PrunePath: Towards Highly Structured Sparse Language Models

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that existing pruning methods struggle to translate sparsity in feedforward networks (FFNs) into hardware-friendly inference acceleration. To this end, we propose PrunePathβ€”a budget-adaptive, structured sparsification framework tailored for FFNs, built upon MoEfication architecture. PrunePath dynamically activates experts using a softmax-normalized routing distribution guided by a cumulative quality threshold. Its key innovation lies in a token-level probabilistic budgeting mechanism that enables continuous, on-the-fly control of sparsity within a single model, offering an out-of-the-box sparsity knob. Coupled with a custom Triton kernel optimized for KV cache decoding, PrunePath achieves superior performance across NLU, NLG, and instruction-tuning tasks, delivering higher memory efficiency and faster decoding speeds while maintaining model accuracy.
πŸ“ Abstract
Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.
Problem

Research questions and friction points this paper is trying to address.

structured sparsity
language models
inference efficiency
feed-forward networks
model pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured sparsification
MoEfication
adaptive expert routing
inference efficiency
Triton kernels
πŸ”Ž Similar Papers
No similar papers found.