🤖 AI Summary
Current fairness evaluations of large language models predominantly focus on stereotype associations, offering limited capacity to anticipate bias risks in real-world scenarios. This work proposes a novel paradigm centered on “sensitive prompts,” introducing the SensY dataset—a multi-domain collection comprising 12,801 prompts that integrates both synthetic and real user data—and establishes prompt sensitivity as an early-warning signal for bias. We develop a highly robust classifier for automatically identifying sensitive prompts and empirically demonstrate that, despite generating factually accurate responses, mainstream open-source models frequently overlook the ethical and contextual implications of sensitive queries. By enabling proactive identification of potentially harmful inputs, our approach shifts fairness assessment from reactive mitigation toward preventive intervention, achieving strong performance in sensitive prompt detection.
📝 Abstract
Large Language Models (LLMs) are being increasingly integrated into software systems, offering powerful capabilities but also raising concerns about fairness. Existing fairness benchmarks, however, focus on stereotype-specific associations, which limit their ability to anticipate risks in diverse, real-world contexts. In this paper, we propose sensitive prompts as a new abstraction for fairness evaluation: inputs that are not inherently biased but are more likely to elicit biased or inadequate responses due to the sensitivity of their content. We construct and release SensY, a dataset of 12,801 prompts, categorized as sensitive and non-sensitive, spanning seven thematic domains, combining synthetic generation and real user inputs. Using this dataset, we query three open-source LLMs and manually analyze 4,500 responses to evaluate their adequacy in answering sensitive prompts. Results show that while models often provide factually correct answers, they frequently fail to acknowledge the ethical, relational, or contextual implications of sensitive inputs. In addition, we develop an automated classifier for predicting prompt sensitivity, achieving robust performance on sensitive prompts. Our findings demonstrate that prompt sensitivity can serve as an effective early-warning mechanism for fairness risks in LLMs. This perspective shifts fairness assessment from reactive mitigation toward preventive design, enabling developers to anticipate and manage bias before deployment.