Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The non-differentiability of the $L_1$ norm hinders its integration with stochastic gradient descent (SGD) in neural network sparsity regularization. Method: We propose Deep Weight Decomposition (DWD), a framework that factorizes each weight parameter into three or more differentiable components, enabling smooth sparse optimization via differentiable $L_2$-norm regularization on the factors—effectively approximating $L_1$ regularization. Contribution/Results: We establish, for the first time, the theoretical equivalence between DWD and non-convex sparse regularization. To overcome expressivity and optimization instability limitations of shallow decompositions, we introduce symmetry-driven parameterization, theoretically grounded initialization, and learning-rate constraints. Experiments across diverse architectures and datasets demonstrate that DWD consistently outperforms shallow decomposition and state-of-the-art pruning methods—particularly at high sparsity levels—achieving superior accuracy–sparsity trade-offs.

Technology Category

Application Category

📝 Abstract
Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the $L_1$ norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of $L_1$-penalized neural networks by adding differentiable $L_2$ regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.
Problem

Research questions and friction points this paper is trying to address.

Improves sparse regularization in neural networks
Extends shallow to deep weight factorization
Optimizes training with tailored initialization and learning rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep weight factorization
Differentiable regularization
Tailored initialization scheme
🔎 Similar Papers
No similar papers found.
Chris Kolb
Chris Kolb
LMU, Munich Center for Machine Learning
Sparsity in DLDNN Theory/GeneralizationStatistical Modeling with DLMultimodal Learning
T
Tobias Weber
Department of Statistics, LMU Munich, Munich; Munich Center for Machine Learning (MCML), Munich
Bernd Bischl
Bernd Bischl
Chair of Statistical Learning and Data Science, LMU Munich
Machine LearningStatisticsData ScienceStatistical LearningScientific Software
D
David Rugamer
Department of Statistics, LMU Munich, Munich; Munich Center for Machine Learning (MCML), Munich