🤖 AI Summary
This work addresses the tendency of large language models (LLMs) in reinforcement learning (RL) to generate overly verbose responses, which incurs excessive inference latency and computational overhead. Existing approaches rely on fixed heuristic rewards that struggle to balance task performance with response conciseness. To overcome this limitation, we propose LACONIC, a method that imposes a target token budget constraint during RL training and dynamically balances task reward against response length through an adaptively adjusted length penalty—without altering the inference pipeline. LACONIC is compatible with standard RL fine-tuning and comes with theoretical guarantees. Experiments show that on mathematical reasoning tasks, it reduces output length by over 50% while maintaining or improving pass@1 accuracy; on general knowledge and multilingual benchmarks, it achieves comparable out-of-domain performance using 44% fewer tokens.
📝 Abstract
Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.