Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the limitations of current large language model (LLM) evaluations, which predominantly rely on single aggregate scores and thus fail to capture fine-grained cognitive differences that hinder targeted improvements and task-specific adaptation. To overcome this, the study introduces cognitive diagnostic theory into LLM assessment for the first time, proposing a cross-disciplinary diagnostic framework grounded in multidimensional item response theory and a problem-ability association matrix. This framework enables interpretable modeling and evaluation of LLMs’ fine-grained cognitive competencies across domains such as mathematics, physics, chemistry, and computer science, while also predicting performance on unseen problems. Experiments across 41 models demonstrate that the framework achieves prediction AUCs ranging from 0.77 to 0.89—significantly outperforming baseline methods—and exhibits consistently effective diagnostic capability across multiple scientific disciplines.

Technology Category

Application Category

📝 Abstract
Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.
Problem

Research questions and friction points this paper is trying to address.

large language models
model evaluation
fine-grained abilities
diagnostic assessment
ability taxonomy
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained ability diagnosis
multidimensional item response theory
cognitive diagnostic framework
ability taxonomy
large language model evaluation
🔎 Similar Papers
No similar papers found.
X
Xu Zhang
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China; State Key Laboratory of Complex & Critical Software Environment, Changsha, Hunan, China
Xudong Gong
Xudong Gong
Unknown affiliation
J
Jiacheng Qin
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China; State Key Laboratory of Complex & Critical Software Environment, Changsha, Hunan, China
Qiang Wang
Qiang Wang
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
GPU ComputingEnergy Efficient ComputingParallel and Distributed SystemsSpatial Intelligence
J
JiaQi Liao
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China; State Key Laboratory of Complex & Critical Software Environment, Changsha, Hunan, China
Z
Zhe Wang
School of Humanities and Social Sciences, School of Public Administration, Beihang University, Beijing, China
D
Dawei Feng
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China; State Key Laboratory of Complex & Critical Software Environment, Changsha, Hunan, China
B
Bo Ding
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China; National Key Laboratory of Parallel and Distributed Computing, Changsha, Hunan, China