POP: Prefill-Only Pruning for Efficient Large Model Inference

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant accuracy degradation commonly observed in existing structured pruning methods, which overlook the asymmetry between the prefill and decode phases in large language model inference. The study reveals, for the first time, the differential dependence of these two phases on model depth and proposes a phase-aware Prefill-Only Pruning (POP) framework. POP selectively prunes only the deeper layers during the compute-intensive prefill phase while preserving the full architecture during the accuracy-sensitive decode phase. To ensure KV cache consistency and high-quality first-token generation, the method incorporates a virtual gating mechanism, separate Key-Value projections, and tailored boundary handling. Evaluated on models including Llama-3.1, Qwen3-VL, and Gemma-3, POP achieves up to 1.37× prefill acceleration with minimal performance loss, effectively overcoming the traditional trade-off between accuracy and efficiency in pruning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
Problem

Research questions and friction points this paper is trying to address.

structured pruning
large language models
inference efficiency
accuracy degradation
prefill-decode asymmetry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefill-Only Pruning
stage-aware pruning
KV cache integrity
virtual gate mechanism
structured pruning
🔎 Similar Papers