🤖 AI Summary
Traditional clinical aphasia assessment tools rely on human-specific pragmatic and cognitive mechanisms, making them incompatible with the text-generation characteristics of large language models (LLMs). To address this gap, we propose TAB—the first text-based aphasia benchmark specifically designed for LLMs. TAB is a semantically and syntactically aligned textual reconstruction of the Quick Aphasia Battery, comprising four subtests and a clinically grounded, quantitative scoring framework. We integrate Gemini 2.5 Flash for fully automated scoring; weighted Cohen’s kappa demonstrates its inter-rater reliability matches that of human experts (κ = 0.255 vs. expert-to-expert κ = 0.286). TAB is open-sourced to enable large-scale, reproducible evaluation of language impairments in LLMs, thereby establishing the first standardized, automated assessment framework for AI-driven computational aphasiology.
📝 Abstract
Large language models (LLMs) have emerged as a candidate"model organism"for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.