From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high latency induced by verbose chain-of-thought (CoT) reasoning, this paper proposes MACC, a Multi-round Adaptive Compression framework. Methodologically, MACC identifies and exploits “token elasticity”—a phenomenon wherein tokens at different positions in CoT exhibit markedly heterogeneous robustness to compression—enabling input-level adaptive, multi-stage progressive compression. It further introduces an interpretable performance prediction model grounded in perplexity and compression ratio, enabling accurate joint estimation of accuracy and latency across diverse large language models without fine-tuning, thus facilitating efficient model selection. Evaluated on multiple state-of-the-art LLMs, MACC achieves an average 5.6% accuracy gain, shortens reasoning chains by 47 tokens, and significantly reduces end-to-end latency. For the first time, it unifies dynamic adaptability, predictive reliability, and cross-model generalizability in CoT compression.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon--where overly small token budgets can paradoxically increase output length--to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance--accuracy and token length--can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.
Problem

Research questions and friction points this paper is trying to address.

Reducing Chain-of-Thought reasoning latency while maintaining performance
Adaptively compressing verbose reasoning chains via multi-round refinement
Predicting compressed reasoning accuracy and length without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressively compresses reasoning chains via multi-round refinement
Adaptively determines optimal compression depth per input
Predicts performance using interpretable perplexity and compression features
🔎 Similar Papers
No similar papers found.
J
Jianzhi Yan
Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China
Le Liu
Le Liu
Northwestern Polytechnical University
VisualizationComputer GraphicsComputer VisionAI
Y
Youcheng Pan
Pengcheng Laboratory, Shenzhen, China
S
Shiwei Chen
Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China
Z
Zike Yuan
Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China
Y
Yang Xiang
Pengcheng Laboratory, Shenzhen, China; Shaoguan Research Institute of Data Industry, China
B
Buzhou Tang
Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China