Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the trade-off between performance and inference cost when large language models are post-trained on compressed reasoning data, a mechanism that remains poorly understood. The study introduces the first taxonomy of compressed chain-of-thought (CoT) reasoning, categorizing it into Explicit, Composed, and Implicit types. Through synthetic compositional reasoning tasks, the authors systematically analyze the effects of compression granularity and data scale using supervised fine-tuning (SFT), reinforcement learning with verifiable rewards (RLVR), and multi-model ablation studies. Key findings reveal that coarse-grained CoT requires larger datasets for compensation, Composed CoT benefits from repeated training, and Implicit CoT is prone to memorization. Moreover, RLVR effectively decouples compression steps learned during SFT, and unidirectional CoT ordering demonstrates superior generalization on long-sequence tasks.
📝 Abstract
Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
Compressed Reasoning
Supervised Fine-Tuning
Post-Training
Token Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

compressed reasoning
chain-of-thought
supervised fine-tuning
reinforcement learning with verifiable rewards
compositional reasoning