POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing structured pruning methods, which rely on static strategies and fail to adapt to the dynamically varying sparsity patterns inherent in autoregressive generation, thereby compromising both inference efficiency and accuracy. To overcome this, the authors propose POP, a lightweight online structured pruning framework that operates without preprocessing, retraining, or auxiliary predictors. POP constructs coarse-grained pruning partitions during a prefill phase and applies fine-grained masks to candidate regions during decoding, enabling a context-aware dynamic sparsity mechanism. By partitioning channels into retained, candidate, and pruned regions, POP effectively balances model accuracy and computational efficiency. Experiments demonstrate that POP consistently outperforms state-of-the-art pruning approaches across diverse large models—including LLMs, MoEs, and VLMs—achieving superior performance with lower computational overhead and reduced latency.

Technology Category

Application Category

📝 Abstract
Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.
Problem

Research questions and friction points this paper is trying to address.

structural pruning
autoregressive generation
sparsity patterns
efficient inference
large foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

online pruning
dynamic sparsity
context-conditioned pruning
structural pruning
efficient inference
🔎 Similar Papers
No similar papers found.
Y
Yi Chen
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
W
Wonjin Shin
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Shuhong Liu
Shuhong Liu
The University of Tokyo
3DVAI4SRobotics
T
Tho Mai
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
J
Jeongmo Lee
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Chuanbo Hua
Chuanbo Hua
Postdoctoral Researcher @ KAIST
Reinforcement LearningCombination OptimizationLLM for Algorithm Design
Kun Wang
Kun Wang
CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences
Molecular ImagingRadiomics
J
Jun Liu
Tokyo Institute of Technology, Tokyo, Japan
Joo-Young Kim
Joo-Young Kim
KAIST
Computer ArchitectureAI AcceleratorSystem-on-ChipFPGA