🤖 AI Summary
To address excessive token consumption and redundant reasoning in small language models (SLMs) caused by uncontrolled halting points during inference, this work first identifies halting control as a critical determinant of SLM inference efficiency. We propose Temperature Scaling (TS), a novel mechanism for dynamically regulating reasoning length, and further introduce TLDR—a multi-level length-regularized reinforcement learning framework integrating TS, GRPO, chain-of-thought intervention, and supervised fine-tuning. Evaluated on four major reasoning benchmarks including MATH500, our method achieves ~50% token savings with negligible accuracy degradation (<0.5%), while enabling fine-grained, controllable response-length generation. Our core contributions are: (1) establishing a new paradigm for halting control in SLMs, and (2) proposing the first multi-level RL training methodology explicitly optimized for length efficiency.
📝 Abstract
Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models distilled with supervised fine-tuning (SFT). In this work, we propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation. We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs. Verbosity also significantly varies across wrong vs correct responses. To address these issues, we propose two solutions: (1) Temperature scaling (TS) to control the stopping point for the thinking phase and thereby trace length, and (2) TLDR: a length-regularized reinforcement learning method based on GRPO that facilitates multi-level trace length control (e.g. short, medium, long reasoning). Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and OlympiadBench, demonstrate that TS is highly effective compared to s1's budget forcing approach and TLDR significantly improves token efficiency by about 50% with minimal to no accuracy loss over the SFT baseline. Moreover, TLDR also facilitates flexible control over the response length, offering a practical and effective solution for token-efficient reasoning in small models. Ultimately, our work reveals the importance of stopping time control, highlights shortcomings of pure SFT, and provides effective algorithmic recipes.