Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates the relationship between chain-of-thought (CoT) length and performance in large language models (LLMs) on complex reasoning tasks, revealing that excessive CoT expansion harms mathematical reasoning accuracy and that optimal CoT length varies across domains. It is the first to identify a pronounced performance inflection point in CoT scaling. To address this, we propose Thinking-Optimal Scaling: a self-improving framework that enables models to adaptively allocate inference effort guided by seed data. It comprises three core components—response-length distribution distillation (using Qwen2.5-32B-Instruct as the base model), multi-granularity modeling of reasoning effort, and an unsupervised shortest-correct-response self-selection mechanism. Evaluated on multiple mathematical reasoning benchmarks, our method outperforms comparable 32B distilled models and matches the performance of QwQ-32B-Preview, empirically validating the efficacy of “optimal, not maximal” test-time computation allocation.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.

Problem

Research questions and friction points this paper is trying to address.

Optimal CoT length for LLM reasoning

Adverse effects of excessive CoT scaling

Domain-specific scaled length distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal CoT length scaling

Self-improvement with shortest responses

Domain-specific reasoning effort adaptation

🔎 Similar Papers

Rational Metareasoning for Large Language Models

2024-10-07arXiv.orgCitations: 3

ByteDance

圣何塞

Authors to Follow