🤖 AI Summary
Large language models (LLMs) suffer from high inference costs and redundant reasoning steps, limiting their efficiency and scalability.
Method: This paper proposes a rational reasoning control framework grounded in cognitive science–inspired meta-reasoning. Its core innovation is the first integration of “value of computation” (VoC) modeling into LLM inference control, implemented via a value-aware reward function, expert-iterated reinforcement training, and a dynamic reasoning-path decision mechanism that enables the model to autonomously determine whether to generate intermediate reasoning steps.
Contribution/Results: Evaluated on three mainstream LLMs, the method reduces inference tokens by 20–37% while matching the accuracy of few-shot chain-of-thought and STaR baselines. It achieves Pareto-optimal trade-offs between inference cost and task performance, significantly improving inference efficiency without sacrificing accuracy.
📝 Abstract
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs), deploying additional inference-time compute to improve task performance. However, as LLMs increase in both size and adoption, inference costs are correspondingly becoming increasingly burdensome. How, then, might we optimize reasoning's cost-performance tradeoff? This work introduces a novel approach based on computational models of metareasoning used in cognitive science, training LLMs to selectively use intermediate reasoning steps only when necessary. We first develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning, then use this reward function with Expert Iteration to train the LLM. Compared to few-shot chain-of-thought prompting and STaR, our method significantly reduces inference costs (20-37% fewer tokens generated across three models) while maintaining task performance across diverse datasets.