🤖 AI Summary
This study addresses the challenge of detecting implicit biases in large language models (LLMs), which remain largely inaccessible to conventional direct-query methods. We propose the first multidimensional implicit bias evaluation framework integrating Stereotype Content Model (SCM) and Theory of Mind (ToM). The framework decomposes bias along three theoretically grounded dimensions—competence, warmth, and morality—and employs non-adversarial, indirect probes: Word Association Bias Test (WABT) and Affect Attribution Test (AAT). Systematic evaluation across eight state-of-the-art LLMs uncovers three novel empirical phenomena: dominance of social-role bias, significant interdimensional bias divergence, and asymmetric stereotype reinforcement. Compared to existing approaches, our framework substantially enhances the depth, robustness, and interpretability of implicit bias detection. It establishes a new theoretical paradigm and practical toolkit for fairness assessment in LLMs.
📝 Abstract
Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.