🤖 AI Summary
Existing intrinsic bias benchmarks—such as SEAT and CAP—exhibit inconsistent measurement outcomes due to their implicit conflation of distinct sociopsychological dimensions of gender stereotyping (e.g., role assignment, trait attribution, competence expectations), rather than quantifying a unitary bias construct.
Method: We propose a theory-guided data distribution alignment framework grounded in social psychology, enabling cross-benchmark calibration to ensure comparability and interpretability of intrinsic bias measurements.
Contribution/Results: Evaluated on two major intrinsic benchmarks, our method significantly improves measurement consistency and explanatory power. It provides the first systematic empirical validation that intrinsic bias assessment is inherently multidimensional. Furthermore, it establishes a finer-grained, theoretically grounded framework for bias evaluation—advancing both methodological rigor and conceptual clarity in language model fairness research.
📝 Abstract
The multifaceted challenge of accurately measuring gender stereotypical bias in language models is akin to discerning different segments of a broader, unseen entity. This short paper primarily focuses on intrinsic bias mitigation and measurement strategies for language models, building on prior research that demonstrates a lack of correlation between intrinsic and extrinsic approaches. We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype. Our methodology involves analyzing data distributions across datasets and integrating gender stereotype components informed by social psychology. By adjusting the distribution of two datasets, we achieve a better alignment of outcomes. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.