Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Structured sparsity regularization is often incompatible with standard SGD due to its non-differentiability, necessitating specialized optimizers or post-hoc pruning—both lacking theoretical guarantees. To address this, we propose $D$-Gating: a differentiable structured over-parameterization technique that decomposes weights group-wise into principal vectors and scalar gating factors, enabling end-to-end optimization. We theoretically establish that, at local minima, $D$-Gating is equivalent to the non-smooth $L_{2,2/D}$ group-sparse penalty, and its gradient flow yields exponential convergence of the objective. Furthermore, we introduce a dynamic sparsity evolution mechanism. Crucially, $D$-Gating is fully compatible with vanilla SGD and requires no additional pruning. Empirically, it significantly outperforms direct optimization of structured penalties and conventional pruning across vision, language, and tabular tasks, achieving state-of-the-art trade-offs between sparsity rate and accuracy.

Technology Category

Application Category

📝 Abstract

Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

Problem

Research questions and friction points this paper is trying to address.

Enables differentiable structured sparsity for neural networks

Provides theoretical equivalence to group sparsity optimization

Achieves strong performance-sparsity tradeoffs across multiple domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable structured overparameterization via D-Gating

Proves equivalence to non-smooth structured L2,2/D penalization

Achieves strong performance-sparsity tradeoffs across multiple domains

🔎 Similar Papers

SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization