Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluations predominantly rely on accuracy metrics, neglecting the efficiency of the reasoning process. This work proposes the Latency-Response Theory (LaRT), the first framework to integrate chain-of-thought (CoT) length—as a proxy for inference latency—into item response theory (IRT), jointly modeling model capability and reasoning efficiency while introducing a capability–speed correlation parameter. We establish theoretical identifiability of all parameters and develop a stochastic approximation EM algorithm for efficient estimation, supported by asymptotic analysis and simulation studies. Empirical evaluation on real-world benchmarks demonstrates that LaRT significantly outperforms conventional IRT: it achieves higher prediction accuracy, more reliable model ranking, and narrower confidence intervals. LaRT thus establishes a novel evaluation paradigm for LLMs that is both accurate and sensitive to reasoning dynamics.

Technology Category

Application Category

📝 Abstract
The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to provide guidance for both downstream applications and actionable future improvements. The Item Response Theory (IRT) model with Computerized Adaptive Testing has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose the Latency-Response Theory (LaRT) model, which jointly models both the response accuracy and CoT length by introducing a key correlation parameter between the latent ability and the latent speed. We derive an efficient stochastic approximation Expectation-Maximization algorithm for parameter estimation. We establish rigorous identifiability results for the latent ability and latent speed parameters to ensure the statistical validity of their estimation. Through both theoretical asymptotic analyses and simulation studies, we demonstrate LaRT's advantages over IRT in terms of superior estimation accuracy and shorter confidence intervals for latent trait estimation. To evaluate LaRT in real data, we collect responses from diverse LLMs on popular benchmark datasets. We find that LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs using response accuracy and reasoning chain length
Introduces LaRT model to jointly model accuracy and CoT length
Improves estimation accuracy and efficiency over traditional IRT methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

LaRT jointly models response accuracy and CoT length
Uses stochastic approximation EM algorithm for parameter estimation
Establishes identifiability for latent ability and speed parameters
🔎 Similar Papers
No similar papers found.
Z
Zhiyu Xu
Department of Statistics, Columbia University
J
Jia Liu
Department of Statistics, Columbia University
Y
Yixin Wang
Department of Statistics, University of Michigan
Yuqi Gu
Yuqi Gu
Columbia University
StatisticsPsychometricsStatistical Machine Learning