🤖 AI Summary
This work addresses systemic unfairness and bias in large language models (LLMs) for detecting sensitive and offensive statements. We propose STOP, a novel benchmark and progressive sensitivity evaluation framework. STOP comprises 450 progressively offensive sequences (2,700 utterances), spanning nine demographic dimensions and 46 subgroups; it is the first to model bias severity as a continuous spectrum. It introduces fine-grained human annotations, cross-model consistency validation, and a DPO-based human preference alignment fine-tuning method. Unlike isolated-scenario evaluations, STOP significantly enhances ecological validity. Experiments reveal that state-of-the-art LLMs achieve only 19.3%–69.8% offensive statement detection accuracy. After STOP-aligned fine-tuning, sensitivity task accuracy improves by up to 191%, with no degradation in generalization performance.
📝 Abstract
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. Furthermore, we demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.