Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning

📅 2026-01-29

📈 Citations: 1

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the inefficiency of large reasoning models caused by redundant reasoning steps that lead to excessively long responses, hindering practical deployment. To tackle this issue, the authors propose SCMA, a multi-agent reinforcement learning framework featuring two collaborative agents—“Segmentation” and “Scoring”—that dynamically identify and preserve essential reasoning logic while compressing superfluous content. The method incorporates an importance-weighted length penalty mechanism, enabling self-compression of reasoning chains without compromising accuracy. Experimental results demonstrate that SCMA consistently reduces response length by 11.1%–39.0% across various model scales while simultaneously improving accuracy by 4.33%–10.02%, thereby achieving significant gains in both reasoning efficiency and overall performance.

Technology Category

Application Category

📝 Abstract

The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1\% to 39.0\% while boosting accuracy by 4.33\% to 10.02\%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.

Problem

Research questions and friction points this paper is trying to address.

reasoning overhead

redundant reasoning

length-accuracy trade-off

Large Reasoning Models

inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Reinforcement Learning

Chain-of-Thought Compression

Redundancy Detection