DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from cognitive inefficiency: they over-reason on simple problems and under-reason on complex ones. Existing approaches—relying on supervised fine-tuning (SFT) or fixed-length reward optimization—often degrade accuracy while enforcing uniform reasoning lengths. This work proposes a dual-reward dynamic reasoning control mechanism. First, it introduces an adaptive-length reward grounded in problem difficulty classification, enabling real-time compression of reasoning chains for simple queries and deepening exploration for complex ones. Second, it synergistically integrates SFT with reinforcement learning to jointly optimize chain-of-thought generation. Evaluated on mathematical reasoning benchmarks, our method significantly outperforms baselines, improving accuracy while reducing average token consumption by 32.7%. It is the first to achieve autonomous alignment between reasoning depth and problem difficulty—simultaneously enhancing both efficiency and reliability.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ``overthinking'' simple problems and ``underthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces extbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for ``Simple'' problems while promoting longer, more exploratory thought chains for ``Hard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
Problem

Research questions and friction points this paper is trying to address.

Improves accuracy and efficiency of Large Reasoning Models
Addresses cognitive inefficiencies like overthinking and underthinking
Dynamically adjusts reasoning chain length based on problem difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive length reward mechanism for problem classification
Dual-reward strategy adjusting reasoning chain length dynamically
Compresses reasoning for simple problems, extends for hard ones
🔎 Similar Papers
No similar papers found.