🤖 AI Summary
Existing LLM evaluations predominantly rely on accuracy metrics, neglecting the efficiency of the reasoning process. This work proposes the Latency-Response Theory (LaRT), the first framework to integrate chain-of-thought (CoT) length—as a proxy for inference latency—into item response theory (IRT), jointly modeling model capability and reasoning efficiency while introducing a capability–speed correlation parameter. We establish theoretical identifiability of all parameters and develop a stochastic approximation EM algorithm for efficient estimation, supported by asymptotic analysis and simulation studies. Empirical evaluation on real-world benchmarks demonstrates that LaRT significantly outperforms conventional IRT: it achieves higher prediction accuracy, more reliable model ranking, and narrower confidence intervals. LaRT thus establishes a novel evaluation paradigm for LLMs that is both accurate and sensitive to reasoning dynamics.
📝 Abstract
The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to provide guidance for both downstream applications and actionable future improvements. The Item Response Theory (IRT) model with Computerized Adaptive Testing has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose the Latency-Response Theory (LaRT) model, which jointly models both the response accuracy and CoT length by introducing a key correlation parameter between the latent ability and the latent speed. We derive an efficient stochastic approximation Expectation-Maximization algorithm for parameter estimation. We establish rigorous identifiability results for the latent ability and latent speed parameters to ensure the statistical validity of their estimation. Through both theoretical asymptotic analyses and simulation studies, we demonstrate LaRT's advantages over IRT in terms of superior estimation accuracy and shorter confidence intervals for latent trait estimation. To evaluate LaRT in real data, we collect responses from diverse LLMs on popular benchmark datasets. We find that LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.