MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the challenge of detecting implicit biases in large language models (LLMs), which remain largely inaccessible to conventional direct-query methods. We propose the first multidimensional implicit bias evaluation framework integrating Stereotype Content Model (SCM) and Theory of Mind (ToM). The framework decomposes bias along three theoretically grounded dimensions—competence, warmth, and morality—and employs non-adversarial, indirect probes: Word Association Bias Test (WABT) and Affect Attribution Test (AAT). Systematic evaluation across eight state-of-the-art LLMs uncovers three novel empirical phenomena: dominance of social-role bias, significant interdimensional bias divergence, and asymmetric stereotype reinforcement. Compared to existing approaches, our framework substantially enhances the depth, robustness, and interpretability of implicit bias detection. It establishes a new theoretical paradigm and practical toolkit for fairness assessment in LLMs.

Technology Category

Application Category

📝 Abstract

Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

Problem

Research questions and friction points this paper is trying to address.

Assessing multi-dimensional implicit bias in LLMs via Theory of Mind

Overcoming limitations of direct-query methods in bias evaluation

Measuring latent stereotypes without triggering model avoidance behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Stereotype Content Model for multi-dimensional bias evaluation

Introduces Word Association Bias Test for implicit lexical associations

Uses Affective Attribution Test to measure covert affective leanings

🔎 Similar Papers

No similar papers found.