🤖 AI Summary
This work addresses a critical limitation in existing calibration methods for large language models (LLMs), which focus on the correctness of individual responses and often fail to reflect the model’s overall task-solving capability, leading to a misalignment between confidence and actual performance. To bridge this gap, we propose a novel paradigm—capability calibration—that estimates the expected accuracy of an LLM on a given query, thereby shifting the focus from response-level to task-level reliability assessment. We formally distinguish capability calibration from traditional response calibration, develop a theoretical framework grounded in the stochasticity of LLM decoding, and systematically evaluate various confidence estimation methods under this new paradigm. Experiments demonstrate that capability calibration substantially improves the accuracy of pass@$k$ prediction and enhances the efficiency of reasoning resource allocation, offering a more reliable foundation for downstream applications.
📝 Abstract
Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.