Understanding the Role of Training Data in Test-Time Scaling

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates how training data characteristics influence large language models’ ability to generate long chain-of-thought (CoT) reasoning and improve inference performance under test-time scaling. Methodologically, it combines theoretical analysis with large-scale Transformer experiments and a linear regression task that predicts context weights from features. The study identifies training data diversity, feature-task correlation, and task difficulty—quantified by the smallest eigenvalue of the feature covariance matrix—as decisive factors governing test-time scaling efficacy. Key contributions include: (i) the first demonstration that increasing test-time computation degrades performance when training data is insufficient; (ii) an explicit trade-off between test-time compute budget and optimal training context length; and (iii) the finding that long CoT improves generalization only under specific data conditions. The results yield actionable, data-centric design principles for efficient inference and provide a unified explanation for diverse empirical phenomena in test-time scaling.

Technology Category

Application Category

📝 Abstract

Test-time scaling improves the reasoning capabilities of large language models (LLMs) by allocating extra compute to generate longer Chains-of-Thoughts (CoTs). This enables models to tackle more complex problem by breaking them down into additional steps, backtracking, and correcting mistakes. Despite its strong performance--demonstrated by OpenAI's o1 and DeepSeek R1, the conditions in the training data under which long CoTs emerge, and when such long CoTs improve the performance, remain unclear. In this paper, we study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression. Our analysis provides a theoretical explanation for several intriguing observations: First, at any fixed test error, increasing test-time compute allows us to reduce the number of in-context examples (context length) in training prompts. Second, if the skills required to solve a downstream task are not sufficiently present in the training data, increasing test-time compute can harm performance. Finally, we characterize task hardness via the smallest eigenvalue of its feature covariance matrix and show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling. We confirm our findings with experiments on large, nonlinear transformer architectures.

Problem

Research questions and friction points this paper is trying to address.

Investigating training data conditions enabling effective long reasoning chains

Analyzing when extended computation improves versus harms model performance

Characterizing task difficulty through feature covariance for optimal scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling extends reasoning via longer Chains-of-Thoughts

Training data diversity determines test-time compute effectiveness

Task hardness characterized via feature covariance eigenvalues

🔎 Similar Papers

No similar papers found.

Authors to Follow