The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Traditional clinical aphasia assessment tools rely on human-specific pragmatic and cognitive mechanisms, making them incompatible with the text-generation characteristics of large language models (LLMs). To address this gap, we propose TAB—the first text-based aphasia benchmark specifically designed for LLMs. TAB is a semantically and syntactically aligned textual reconstruction of the Quick Aphasia Battery, comprising four subtests and a clinically grounded, quantitative scoring framework. We integrate Gemini 2.5 Flash for fully automated scoring; weighted Cohen’s kappa demonstrates its inter-rater reliability matches that of human experts (κ = 0.255 vs. expert-to-expert κ = 0.286). TAB is open-sourced to enable large-scale, reproducible evaluation of language impairments in LLMs, thereby establishing the first standardized, automated assessment framework for AI-driven computational aphasiology.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have emerged as a candidate"model organism"for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.

Problem

Research questions and friction points this paper is trying to address.

Develop benchmark for assessing aphasia-like deficits in language models

Adapt clinical assessments to evaluate linguistic disorders in artificial systems

Create scalable framework for analyzing language deficits computationally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Aphasia Battery adapted from clinical assessment

Four subtests measure comprehension and repetition deficits

Automated evaluation protocol validated using Gemini model

🔎 Similar Papers

No similar papers found.