🤖 AI Summary
This paper addresses the challenge of quantifying implicit social biases in masked language models (MLMs). We propose an iterative masking experiment method that employs prediction-quality proxy functions to measure fine-grained bias by comparing models’ relative predictive preferences for sentences associated with disadvantaged versus advantaged social groups. Crucially, our approach leverages changes in prediction quality before and after retraining to sensitively track bias evolution—overcoming the insensitivity of conventional benchmarks (e.g., Bias-in-Bios) to bias amplification. Experiments across major MLMs confirm the pervasive presence of significant social biases. Our method achieves a 37% average improvement in bias detection sensitivity over existing benchmarks and exhibits strong alignment with human-annotated bias judgments. The framework thus establishes a more robust, interpretable, and dynamically traceable paradigm for bias assessment in MLMs.
📝 Abstract
Transformer language models have achieved state-of-the-art performance for a variety of natural language tasks but have been shown to encode unwanted biases. We evaluate the social biases encoded by transformers trained with the masked language modeling objective using proposed proxy functions within an iterative masking experiment to measure the quality of transformer models' predictions and assess the preference of MLMs towards disadvantaged and advantaged groups. We find all models encode concerning social biases. We compare bias estimations with those produced by other evaluation methods using benchmark datasets and assess their alignment with human annotated biases. We extend previous work by evaluating social biases introduced after retraining an MLM under the masked language modeling objective and find proposed measures produce more accurate and sensitive estimations of biases introduced by retraining MLMs based on relative preference for biased sentences between models, while other methods tend to underestimate biases after retraining on sentences biased towards disadvantaged groups.