Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large reasoning-oriented language models (e.g., DeepSeek-R1) incur high deployment costs due to lengthy chain-of-thought (CoT) reasoning, while conventional pruning methods suffer from substantial performance degradation in decoding-dominated tasks. To address this, we propose *reasoning-aware compression*, the first framework that incorporates the model’s own CoT trajectories into pruning—jointly reconstructing both input representations and critical CoT neuron activations to precisely preserve reasoning pathways. Built upon SparseGPT, our method integrates CoT activation-guided sparsification and reconstruction, enhancing compression robustness without fine-tuning. Experiments across multiple mathematical and reasoning benchmarks demonstrate that, at 40%–60% parameter compression rates, our approach achieves an average accuracy gain of 8.2% and reduces inference latency by 37%, significantly outperforming existing pruning methods. This work establishes a new paradigm for efficient deployment of reasoning-intensive LMs.

Technology Category

Application Category

📝 Abstract

Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC

Problem

Research questions and friction points this paper is trying to address.

Pruning reasoning models causes performance loss and inefficiency

Standard pruning methods focus on input not output reconstruction

Proposes joint input and chain-of-thought reconstruction during pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs input and chain-of-thought activations

Integrates seamlessly into existing pruning workflows

Uses reasoning-aware compression to boost performance

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting