Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of coarse-grained metrics in evaluating large language models (LLMs) by introducing cognitive diagnostic models from psychometrics to enable fine-grained assessment of reasoning capabilities. The authors propose a scalable diagnostic framework that leverages textual embeddings to construct a Q-matrix prior, allowing stable estimation of high-dimensional proficiency parameters without manual annotation. They further develop a stochastic approximation algorithm to jointly optimize both the LLM’s mastery profile and the Q-matrix, supporting large-scale diagnosis with thousands of items. Experimental results demonstrate that the method accurately recovers ground-truth parameters in simulations and reveals interpretable, fine-grained competence structures of LLMs on the MATH Level 5 benchmark, offering both practical utility and theoretical insight.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.
Problem

Research questions and friction points this paper is trying to address.

cognitive diagnosis
large language models
fine-grained evaluation
Q-matrix estimation
text embedding
Innovation

Methods, ideas, or system contributions that make the work stand out.

cognitive diagnosis models
large language models
text embeddings
Q-matrix estimation
scalable evaluation
🔎 Similar Papers
No similar papers found.
J
Jia Liu
Department of Statistics, Columbia University
Z
Zhiyu Xu
Department of Statistics, Columbia University
Yuqi Gu
Yuqi Gu
Columbia University
StatisticsPsychometricsStatistical Machine Learning