🤖 AI Summary
Prior work lacks systematic evaluation of large language models’ (LLMs) capability to detect social biases—particularly across multiple demographic groups, diverse content types (e.g., hate speech, stereotypes), and intersectional bias configurations. Method: We introduce the first multi-label bias detection benchmark for English text, grounded in a fine-grained demographic taxonomy and spanning 12 heterogeneous datasets, multiple content categories, and cross-cutting demographic axes (e.g., gender × race). We conduct comprehensive benchmarking across LLMs of varying scales and architectures using prompt engineering, in-context learning, and supervised fine-tuning. Contribution/Results: Fine-tuned smaller models outperform larger ones in scalability and generalization; however, all models exhibit substantial performance gaps in detecting intersectional biases. This work establishes a standardized evaluation paradigm and delivers critical empirical insights for bias detection research.
📝 Abstract
Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.