Are Models Biased on Text without Gender-related Language?

ICLR 2024

University of California Irvine

Abstract

We introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 evaluated models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation.

Background

Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings?

UnStereo Eval

UnStereoEval evaluates gender fairness of language models using non-stereotypical English sentences. It builds on word-level pretraining statistics to quantify the gender correlation of each word and thus ensure minimal gender correlations within sentences. Because we remove gender correlations present in the pretraining data, we hypothesize that language models trained on the same data should assign (approximately) equal probability mass to minimally distant gendered sentences.
UnStereoEval
The framework can be used to filter out sentences of gender-invariant datasets, such as Winobias (WB) or Winogender (WG), and, thus, restrict the evaluation to a subset with minimal gender correlations. But doing so may lead to considerably smaller datasets, since these datasets are created to surface stereotypical biases (e.g., gender-occupation, gender-emotion) known to be pervasive in training datasets.

Creation of non-stereotypical benchmarks

The reliable assessment of fairness in stereotype-free scenarios requires varied and natural-sounding sentences. To address this requirement, we develop an automated model-based pipeline to create non-stereotypical English sentences. The pipeline consists of two parts:
UnStereoEval
The pipeline can be repeated multiple times until a desired benchmark size is achieved.

Evaluation using UnStereo Eval

Ideally, a model evaluated on non-stereotypical sentences should exhibit no bias towards either gender, especially when evaluated on datasets that impose stricter gender co-occurrence constraints. We capture this intuition with the Unstereo Score metric, which measures the percentage of examples for which language models exhibits no gender preference. An unbiased model should have an Unstereo Score of 100%, i.e., it assigns roughly the same probability mass to both male and gendered completions.

In addition to Unstereo Score, we measure model's tendency to systematically assign higher propability mass to one specific gender over the other. This is quantified by the gender-based Preference disparity metric, which measures the percentage of preference pairs for which the model assigns higher probability mass to the male completion. A model with no preference disparity should have a score of 50%, which implies that among pairs for which the model exhibits preferences, it prefers female completions for 50% of those pairs and the other 50% prefers male.

Metrics:

Benchmark Size:

Benchmark Examples:

BibTeX

@inproceedings{belem2024-unstereoeval,
  title={Are Models Biased on Text without Gender-related Language?},
  author={Catarina G Bel{\'e}m and Preethi Seshadri and Yasaman Razeghi and Sameer Singh},
  month={May},
  year={2024},
  booktitle={The Twelfth International Conference on Learning Representations},
  url={https://openreview.net/forum?id=w1JanwReU6}
}