๐ค AI Summary
This work addresses the high computational cost incurred by large reasoning models when generating redundant chains of thought in complex tasks. Existing compression approaches often compromise logical coherence or entail substantial sampling overhead. To overcome these limitations, the authors propose a reinforcement learningโbased framework for compressing reasoning trajectories, formulating compression as a reward optimization problem driven by a weighted combination of answer correctness and reasoning confidence. A frozen auxiliary large reasoning model is introduced to jointly preserve prediction accuracy and reasoning validity. Evaluated across five reasoning benchmarks, the method reduces average reasoning length by 43% with only a 0.7% drop in accuracy, achieving a significantly improved trade-off between efficiency and performance.
๐ Abstract
Recent breakthroughs in Large Reasoning Models (LRMs) have demonstrated that extensive Chain-of-Thought (CoT) generation is critical for enabling intricate cognitive behaviors, such as self-verification and backtracking, to solve complex tasks. However, this capability often leads to ``overthinking'', where models generate redundant reasoning paths that inflate computational costs without improving accuracy. While Supervised Fine-Tuning (SFT) on reasoning traces is a standard paradigm for the'cold start'phase, applying existing compression techniques to these traces often compromises logical coherence or incurs prohibitive sampling costs. In this paper, we introduce ConMax (Confidence-Maximizing Compression), a novel reinforcement learning framework designed to automatically compress reasoning traces while preserving essential reasoning patterns. ConMax formulates compression as a reward-driven optimization problem, training a policy to prune redundancy by maximizing a weighted combination of answer confidence for predictive fidelity and thinking confidence for reasoning validity through a frozen auxiliary LRM. Extensive experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off. Specifically, it reduces inference length by 43% over strong baselines at the cost of a mere 0.7% dip in accuracy, proving its effectiveness in generating high-quality, efficient training data for LRMs.