Efficient Test-Time Scaling via Self-Calibration

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address inefficient static computational resource allocation in large language model (LLM) testing, this paper proposes a dynamic test-time scaling method based on self-calibrated confidence estimation. The core innovation is the Self-Calibration mechanism: it distills external confidence—derived from Self-Consistency sampling—into the model’s internal representations, enabling reliable confidence estimation via a single forward pass. This calibrated confidence then drives two adaptive inference strategies: Early-Stopping Best-of-N and calibrated Self-Consistency—terminating inference early on simple queries while intensifying reasoning on complex ones. Evaluated on MathQA, the method improves accuracy by 2.6 percentage points (81.0% → 83.6%) under a 16-sample budget. Across three distinct LLM architectures and six benchmarks, it significantly reduces redundant computation, enhances response quality, and improves energy efficiency.

Technology Category

Application Category

📝 Abstract

Increasing test-time computation is a straightforward approach to enhancing the quality of responses in Large Language Models (LLMs). While Best-of-N sampling and Self-Consistency with majority voting are simple and effective, they require a fixed number of sampling responses for each query, regardless of its complexity. This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones. In this work, we argue that model confidence of responses can be used for improving the efficiency of test-time scaling. Unfortunately, LLMs are known to be overconfident and provide unreliable confidence estimation. To address this limitation, we introduce Self-Calibration by distilling Self-Consistency-derived confidence into the model itself. This enables reliable confidence estimation at test time with one forward pass. We then design confidence-based efficient test-time scaling methods to handle queries of various difficulty, such as Early-Stopping for Best-of-N and Self-Consistency with calibrated confidence. Experiments on three LLMs across six datasets demonstrate the effectiveness of our approach. Specifically, applying confidence-based Early Stopping to Best-of-N improves MathQA accuracy from 81.0 to 83.6 with a sample budget of 16 responses, indicating the efficacy of confidence-based sampling strategy at inference time.

Problem

Research questions and friction points this paper is trying to address.

Improves efficiency of test-time scaling in LLMs.

Addresses overconfidence and unreliable confidence estimation in LLMs.

Introduces Self-Calibration for reliable confidence estimation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Calibration for reliable confidence estimation

Confidence-based Early Stopping for Best-of-N

Self-Consistency with calibrated confidence

🔎 Similar Papers

Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad