POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the limitations of existing structured pruning methods, which rely on static strategies and fail to adapt to the dynamically varying sparsity patterns inherent in autoregressive generation, thereby compromising both inference efficiency and accuracy. To overcome this, the authors propose POP, a lightweight online structured pruning framework that operates without preprocessing, retraining, or auxiliary predictors. POP constructs coarse-grained pruning partitions during a prefill phase and applies fine-grained masks to candidate regions during decoding, enabling a context-aware dynamic sparsity mechanism. By partitioning channels into retained, candidate, and pruned regions, POP effectively balances model accuracy and computational efficiency. Experiments demonstrate that POP consistently outperforms state-of-the-art pruning approaches across diverse large models—including LLMs, MoEs, and VLMs—achieving superior performance with lower computational overhead and reduced latency.

Technology Category

Application Category

📝 Abstract

Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.

Problem

Research questions and friction points this paper is trying to address.

structural pruning

autoregressive generation

sparsity patterns

efficient inference

large foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

online pruning

dynamic sparsity

context-conditioned pruning