🤖 AI Summary
Large language models (LLMs) often generate excessively verbose responses during inference, incurring high computational costs and failing to leverage historical interaction data effectively.
Method: This paper proposes a history-aware response length optimization framework. It constructs problem-level historical state representations—e.g., the shortest historically correct response length—and introduces a dual-objective reward mechanism: ensuring correctness while incorporating an adaptive length penalty to progressively compress reasoning chains. Policy optimization is performed via Proximal Policy Optimization (PPO), enabling stable fine-tuning without over-penalizing short but incorrect responses.
Contribution/Results: The method balances exploration and stability while avoiding brittle optimization. Evaluated on multi-difficulty mathematical reasoning benchmarks, it achieves 33–59% reduction in response length with only a 2–5% accuracy drop, significantly improving inference efficiency and solution conciseness.
📝 Abstract
While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs' concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.