Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical gap in evaluating large language models’ (LLMs) ability to detect nuanced ableism—subtle, context-dependent discriminatory language targeting autistic individuals—challenging the flawed assumption that lexical recognition suffices for bias identification. Method: We propose a dual-dimension evaluation framework incorporating contextual framing, speaker identity, and sociocultural impact, validated via a human-annotated dataset and combining quantitative classification with qualitative interpretability analysis. Contribution/Results: Four state-of-the-art LLMs achieve high accuracy in recognizing autism-related terminology but consistently fail to identify semantic-level ableism, relying heavily on superficial keyword matching rather than deep contextual reasoning. Human–LLM annotation agreement is strong (Cohen’s κ > 0.8), confirming binary classification as sufficient for this task. Our work refutes the reductive paradigm equating term understanding with bias detection and establishes a novel benchmark and methodological foundation for assessing disability-inclusive AI.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used in decision-making tasks like r'esum'e screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to detect nuanced ableism in text
Assessing gap between LLM terminology understanding and contextual recognition
Comparing human and LLM explanations for ableism classification accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs on nuanced ableism detection
Comparing human and LLM annotation approaches
Assessing binary classification for ableism evaluation
🔎 Similar Papers
No similar papers found.
Naba Rizvi
Naba Rizvi
PhD Student, UCSD
multimodal AINLP
H
Harper Strickland
University of California, San Diego
S
Saleha Ahmedi
University of California, San Diego
A
Aekta Kallepalli
University of California, San Diego
I
Isha Khirwadkar
University of California, San Diego
William Wu
William Wu
University of California, San Diego
Imani N. S. Munyaka
Imani N. S. Munyaka
UCSD
Usable Security and PrivacyTechnology Policy
Nedjma Ousidhoum
Nedjma Ousidhoum
Lecturer (Assistant Professor), Cardiff University
Natural Language ProcessingComputational Social ScienceMachine Learning