TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the efficiency bottleneck posed by lengthy output generation in large language models (LLMs) during complex reasoning, this paper proposes a dynamic ratio reweighting training paradigm grounded in dual-process cognitive theory. The method adaptively prunes reasoning chains by online adjusting training weights between System-1 (intuitive) and System-2 (deliberative) data—requiring no human annotations, auxiliary models, or ensemble techniques. Its core innovations are (i) a learnable dynamic weight scheduling mechanism and (ii) a synergistic Chain-of-Thought distillation framework for joint optimization. Evaluated on the DeepSeek-R1-Distill model series, the approach achieves an average 39.7% reduction in output token count while preserving inference accuracy—yielding substantial gains in throughput and deployment feasibility for long-reasoning tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.
Problem

Research questions and friction points this paper is trying to address.

Efficient language reasoning with long outputs
Reducing redundant reasoning processes in LLMs
Maintaining accuracy while compressing output tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic ratio-based training pipeline
Balancing System-1 and System-2 weights
Reduces output tokens by 40%
🔎 Similar Papers
No similar papers found.
Z
Zhong-Zhi Li
Institute of Automation, Chinese Academy of Sciences
X
Xiao Liang
University of California, Los Angeles
Z
Zihao Tang
Microsoft
L
Lei Ji
Microsoft
Peijie Wang
Peijie Wang
Institute of Automation Chinese Academy of Sciences
Multimodal LLMsmath reasoning
H
Haotian Xu
Tsinghua University
W
W. Xing
Institute of Automation, Chinese Academy of Sciences
Haizhen Huang
Haizhen Huang
Microsoft
Weiwei Deng
Weiwei Deng
Professor of Mechanical Engineering, Southern University of Science and Technology
electrosprayfluid dynamicsoptofluidics
Ying Nian Wu
Ying Nian Wu
UCLA Department of Statistics and Data Science
Generative AIRepresentation learningComputer visionComputational neuroscienceBioinformatics
Yeyun Gong
Yeyun Gong
Microsoft Research Asia
Natural Language GenerationQuestion AnsweringPre-training
Zhijiang Guo
Zhijiang Guo
HKUST (GZ) | HKUST
Natural Language ProcessingMachine LearningLarge Language Models
X
Xiao Liu
Microsoft
F
Fei Yin
Institute of Automation, Chinese Academy of Sciences
C
Cheng-Lin Liu
Institute of Automation, Chinese Academy of Sciences