Deterministic Differentiable Structured Pruning for Large Language Models

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges posed by the discrete nature of the β„“β‚€ norm in structured pruning of large language models, which leads to training–test inconsistency and slow convergence. To overcome these issues, the authors propose Deterministic Differentiable Pruning (DDP), a method that optimizes a deterministic soft approximation of the β„“β‚€ objective, eliminating the need for stochastic Concrete relaxation and enabling end-to-end mask learning without random sampling. DDP enhances mask expressiveness, reduces the gap between training and deployment, and significantly accelerates convergence. Experiments on models such as Qwen3-32B demonstrate that DDP achieves only a 1% performance drop at 20% sparsity while substantially improving inference speed, outperforming existing pruning approaches.

Technology Category

Application Category

πŸ“ Abstract
Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.
Problem

Research questions and friction points this paper is trying to address.

structured pruning
large language models
l0 sparsity
train-test mismatch
deterministic optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deterministic Differentiable Pruning
structured pruning
l0 sparsity
train-test mismatch
large language models
πŸ”Ž Similar Papers
No similar papers found.