Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning-oriented language models (e.g., DeepSeek-R1) incur high deployment costs due to lengthy chain-of-thought (CoT) reasoning, while conventional pruning methods suffer from substantial performance degradation in decoding-dominated tasks. To address this, we propose *reasoning-aware compression*, the first framework that incorporates the model’s own CoT trajectories into pruning—jointly reconstructing both input representations and critical CoT neuron activations to precisely preserve reasoning pathways. Built upon SparseGPT, our method integrates CoT activation-guided sparsification and reconstruction, enhancing compression robustness without fine-tuning. Experiments across multiple mathematical and reasoning benchmarks demonstrate that, at 40%–60% parameter compression rates, our approach achieves an average accuracy gain of 8.2% and reduces inference latency by 37%, significantly outperforming existing pruning methods. This work establishes a new paradigm for efficient deployment of reasoning-intensive LMs.

Technology Category

Application Category

📝 Abstract
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC
Problem

Research questions and friction points this paper is trying to address.

Pruning reasoning models causes performance loss and inefficiency
Standard pruning methods focus on input not output reconstruction
Proposes joint input and chain-of-thought reconstruction during pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs input and chain-of-thought activations
Integrates seamlessly into existing pruning workflows
Uses reasoning-aware compression to boost performance
🔎 Similar Papers
No similar papers found.