🤖 AI Summary
This study exposes systemic ableist bias in large language models (LLMs) within hiring contexts, particularly examining how intersecting marginalizations—such as gender and caste in Global South settings—exacerbate discrimination against persons with disabilities (PwD). We introduce ABLEIST, the first evaluation framework specifically designed to assess disability-related bias, grounded in disability studies theory and comprising five categories of ableist harm and three metrics for intersectional harm. We conduct a large-scale audit across six state-of-the-art LLMs using 2,820 diverse hiring prompts. Results demonstrate that LLMs significantly amplify negative stereotyping of PwD candidates, with the most severe harms concentrated among multiply marginalized groups (e.g., disabled women from historically oppressed castes). Critically, existing safety classifiers exhibit near-zero detection capability for such sociotechnical biases. This work provides the first systematic, quantitative analysis of intersectional ableism in Global South contexts, establishing a novel benchmark and methodological advancement for evaluating LLM fairness.
📝 Abstract
Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization--such as gender and caste--shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates--harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.