🤖 AI Summary
This study identifies systematic inclusivity gaps in large language models’ (LLMs) accessibility support: while visual, auditory, and motor impairments receive relatively greater attention, speech, genetic/developmental, sensory-cognitive, and mental health disabilities remain chronically marginalized. Method: We construct the first human-validated, general-purpose accessibility benchmark and propose a three-dimensional evaluation framework—assessing coverage breadth, category balance, and response specificity—integrating taxonomy-aligned benchmark design, taxonomy-aware prompting, and training strategy exploration. Contribution/Results: Quantitative evaluation across 17 mainstream LLMs reveals significantly low response coverage and shallow support depth for the four underrepresented disability categories. Our findings empirically confirm a pronounced structural imbalance in current LLMs’ accessibility capabilities, establishing a reproducible benchmark and evidence-based foundation for rigorous accessibility assessment and fairness-oriented model improvement.
📝 Abstract
Large Language Models (LLMs) are increasingly used for accessibility guidance, yet many disability groups remain underserved by their advice. To address this gap, we present taxonomy aligned benchmark1 of human validated, general purpose accessibility questions, designed to systematically audit inclusivity across disabilities. Our benchmark evaluates models along three dimensions: Question-Level Coverage (breadth within answers), Disability-Level Coverage (balance across nine disability categories), and Depth (specificity of support). Applying this framework to 17 proprietary and open-weight models reveals persistent inclusivity gaps: Vision, Hearing, and Mobility are frequently addressed, while Speech, Genetic/Developmental, Sensory-Cognitive, and Mental Health remain under served. Depth is similarly concentrated in a few categories but sparse elsewhere. These findings reveal who gets left behind in current LLM accessibility guidance and highlight actionable levers: taxonomy-aware prompting/training and evaluations that jointly audit breadth, balance, and depth.