PrunePath: Towards Highly Structured Sparse Language Models

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge that existing pruning methods struggle to translate sparsity in feedforward networks (FFNs) into hardware-friendly inference acceleration. To this end, we propose PrunePath—a budget-adaptive, structured sparsification framework tailored for FFNs, built upon MoEfication architecture. PrunePath dynamically activates experts using a softmax-normalized routing distribution guided by a cumulative quality threshold. Its key innovation lies in a token-level probabilistic budgeting mechanism that enables continuous, on-the-fly control of sparsity within a single model, offering an out-of-the-box sparsity knob. Coupled with a custom Triton kernel optimized for KV cache decoding, PrunePath achieves superior performance across NLU, NLG, and instruction-tuning tasks, delivering higher memory efficiency and faster decoding speeds while maintaining model accuracy.

📝 Abstract

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

Problem

Research questions and friction points this paper is trying to address.

structured sparsity

language models

inference efficiency

feed-forward networks

model pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured sparsification

MoEfication

adaptive expert routing