🤖 AI Summary
Large reasoning-oriented language models (e.g., DeepSeek-R1) incur high deployment costs due to lengthy chain-of-thought (CoT) reasoning, while conventional pruning methods suffer from substantial performance degradation in decoding-dominated tasks. To address this, we propose *reasoning-aware compression*, the first framework that incorporates the model’s own CoT trajectories into pruning—jointly reconstructing both input representations and critical CoT neuron activations to precisely preserve reasoning pathways. Built upon SparseGPT, our method integrates CoT activation-guided sparsification and reconstruction, enhancing compression robustness without fine-tuning. Experiments across multiple mathematical and reasoning benchmarks demonstrate that, at 40%–60% parameter compression rates, our approach achieves an average accuracy gain of 8.2% and reduces inference latency by 37%, significantly outperforming existing pruning methods. This work establishes a new paradigm for efficient deployment of reasoning-intensive LMs.
📝 Abstract
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC