Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Existing research indicates that structured pruning substantially degrades the test-time scaling (TTS) performance of large language models, yet it remains unclear how to compress models while preserving or even enhancing TTS capabilities. This work systematically investigates the impact of unstructured pruning on TTS, conducting experiments on the S1.1-7B and Qwen3-8B models with various inter-layer sparsity allocation strategies. The results demonstrate that unstructured pruning not only effectively mitigates performance degradation but also surpasses the original unpruned models across multiple reasoning benchmarks, significantly outperforming structured pruning. These findings challenge the prevailing assumption that pruning inevitably harms TTS and offer a promising new direction for efficient model compression that supports high-performance inference.
📝 Abstract
While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.
Problem

Research questions and friction points this paper is trying to address.

LLM pruning
test-time scaling
reasoning performance
unstructured pruning
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

unstructured pruning
test-time scaling
reasoning LLMs
sparsity allocation
model compression
🔎 Similar Papers