ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) frequently generate redundant reasoning steps in chain-of-thought (CoT) inference, leading to high computational overhead and degraded user experience; existing compression methods—largely post-hoc or sampling-based—struggle to preserve reasoning coherence. This work identifies two root causes of redundancy: insufficient step-wise confidence and delayed termination. We propose the first confidence-guided real-time compression framework that jointly suppresses redundant reflection at the generation source. Our approach integrates intermediate-layer confidence modeling and injection, dynamic early stopping, and SimPO-based alignment fine-tuning. Experiments across multiple reasoning benchmarks show that our method reduces average CoT length by ~50% while maintaining near-original accuracy—significantly outperforming prior compression techniques.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Redundant outputs in Large Reasoning Models increase computational overhead.
Existing compression methods disrupt reasoning coherence or fail during generation.
ConCISE reduces output length by 50% while maintaining high accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence-guided compression prevents redundant reasoning steps
Integrates Confidence Injection to stabilize intermediate steps
Employs Early Stopping when confidence is sufficient
Z
Ziqing Qiao
Tsinghua University
Y
Yongheng Deng
Tsinghua University
Jiali Zeng
Jiali Zeng
Tencent
Natural Language ProcessingDeep LearningNeural Machine Translation
D
Dong Wang
Tsinghua University
L
Lai Wei
Tsinghua University
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc., China
Ju Ren
Ju Ren
Department of Computer Science and Technology, Tsinghua University
Internet-of-ThingsEdge Computing/IntelligenceSecurity and Privacy
Y
Yaoxue Zhang
Tsinghua University