🤖 AI Summary
Large language models (LLMs) often blindly rely on excessively long chain-of-thought (CoT) reasoning, leading to unnecessary computation for simple problems and insufficient resource allocation for complex ones. Method: This paper proposes a utility-maximization optimization framework constrained by learnable inference budgets. It integrates supervised fine-tuning (LLaMA3.1-8B Instruct), inference-budget-constrained policy optimization (IBPO), and utility-driven reinforcement learning—departing from conventional single-modality long-CoT paradigms. Contribution/Results: Crucially, it introduces the first learnable inference budget constraint enabling difficulty-aware, adaptive CoT length control. On MATH500, it achieves absolute accuracy gains of +4.14% and +5.74% (relative improvements of +8.08% and +11.2%) over baselines under 2.16× and 4.32× inference budgets, respectively—yielding approximately double the performance of self-consistency at equivalent budget levels.
📝 Abstract
Solving mathematics problems has been an intriguing capability of large language models, and many efforts have been made to improve reasoning by extending reasoning length, such as through self-correction and extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain models exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned through IBPO learn to ``understand'' the difficulty of queries and allocate inference budgets to harder ones. With different inference budgets, our best models are able to have a $4.14$% and $5.74$% absolute improvement ($8.08$% and $11.2$% relative improvement) on MATH500 using $2.16$x and $4.32$x inference budgets respectively, relative to LLaMA3.1 8B Instruct. These improvements are approximately $2$x those of self-consistency under the same budgets.