🤖 AI Summary
Existing pruning methods exhibit fragility on reasoning-oriented large language models (RLMs), suffering substantial degradation in both reasoning accuracy and output coherence even at moderate sparsity levels (e.g., 20%). To address this, we propose a reflective structured pruning framework that uniquely leverages the model’s self-generated reasoning traces as calibration signals. Our method integrates decoder-layer gradient-based importance estimation with progressive feature regeneration, dynamically aligning pruning decisions with multi-step autoregressive decoding. The core innovation lies in introducing a self-reflection mechanism that explicitly models consistency between pruning behavior and reasoning dynamics. Experiments on Qwen3-8B demonstrate that our approach preserves near-full-parameter accuracy at 20–30% sparsity; at 40% sparsity, it achieves 81.3% accuracy on GSM8K and 59.6% on MathQA—surpassing the strongest baseline by 66.87 and 47.0 percentage points, respectively.
📝 Abstract
Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.