Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limitation of coarse-grained metrics in evaluating large language models (LLMs) by introducing cognitive diagnostic models from psychometrics to enable fine-grained assessment of reasoning capabilities. The authors propose a scalable diagnostic framework that leverages textual embeddings to construct a Q-matrix prior, allowing stable estimation of high-dimensional proficiency parameters without manual annotation. They further develop a stochastic approximation algorithm to jointly optimize both the LLM’s mastery profile and the Q-matrix, supporting large-scale diagnosis with thousands of items. Experimental results demonstrate that the method accurately recovers ground-truth parameters in simulations and reveals interpretable, fine-grained competence structures of LLMs on the MATH Level 5 benchmark, offering both practical utility and theoretical insight.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.

Problem

Research questions and friction points this paper is trying to address.

cognitive diagnosis

large language models

fine-grained evaluation

Q-matrix estimation

text embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

cognitive diagnosis models

large language models

text embeddings