Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Standard cross-entropy (CE) loss training misaligns with the pass@N evaluation objective in mathematical reasoning and code generation: CE encourages overconfident predictions, suppressing output diversity and degrading pass@N accuracy. Method: We propose a confidence-constrained training loss that explicitly regularizes the concentration of the model’s output distribution, enhancing sampling efficiency and search robustness during inference—particularly under computational expansion (e.g., multi-path proof-tree search). Contribution/Results: Our method achieves significant improvements in pass@k (k = 1, 8, 64) on MATH and MiniF2F benchmarks, especially for complex theorem proving and discovering diverse solution paths. It enables co-optimization of training objectives and inference-time computation scaling, establishing a novel, computation-aware paradigm for LLM training.

Technology Category

Application Category

📝 Abstract

Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be ${it misaligned}$ with pass@N in that pass@N accuracy ${it decreases}$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.

Problem

Research questions and friction points this paper is trying to address.

Optimizing model training for test-time compute

Addressing overconfidence in cross-entropy loss

Improving mathematical reasoning via modified training loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Limits model confidence

Modifies training loss

Aligns with pass@N

🔎 Similar Papers

No similar papers found.