Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of test-time scaling methods in mathematical reasoning that rely on external verifiers or reward signals. We propose a verifier-free path selection mechanism grounded solely in the language model’s intrinsic prefix confidence. Our core innovation is a prefix confidence scaling strategy: using only the first 32 tokens of each generated sequence, we estimate path potential via prefix probability and dynamically prune low-confidence trajectories, integrating test-time training for refinement. This approach avoids the length bias inherent in Best-of-N (BoN) sampling and requires no external supervision. Evaluated on five benchmarks—GSM8K, MATH500, AMC23, and AIME23–25—the method achieves superior accuracy–compute trade-offs, outperforming mainstream verifier-free baselines such as majority voting.

Technology Category

Application Category

📝 Abstract
Recent work has shown that language models can self-improve by maximizing their own confidence in their predictions, without relying on external verifiers or reward signals. In this work, we study the test-time scaling of language models for mathematical reasoning tasks, where the model's own confidence is used to select the most promising attempts. Surprisingly, we find that we can achieve significant performance gains by continuing only the most promising attempt, selected by the model's prefix-confidence. We systematically evaluate prefix-confidence scaling on five mathematical reasoning datasets: the school-level GSM8K and MATH500, and the competition-level AMC23, AIME24, and AIME25. We find that prefix-confidence scaling with prefixes of only 32 tokens achieves a better accuracy-compute trade-off than majority voting. Moreover, prefix-confidence scaling appears less susceptible than BoN to length biases. Finally, we also evaluate test-time training with prefix-confidence and find that, while outperforming the base model, it does not improve over prefix-confidence scaling.
Problem

Research questions and friction points this paper is trying to address.

Improving mathematical reasoning via prefix-confidence scaling
Evaluating accuracy-compute trade-off in language models
Comparing prefix-confidence scaling with majority voting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses prefix-confidence for attempt selection
Scales with 32-token prefixes efficiently
Outperforms majority voting in accuracy-compute trade-off
🔎 Similar Papers
No similar papers found.