🤖 AI Summary
This work addresses the challenge of precisely controlling model sparsity in Bregman-based optimizers, where conventional ℓ₁ regularization requires extensive hyperparameter tuning. We propose the first sparsity-feedback-driven adaptive λ adjustment mechanism within the Bregman optimization framework (LinBreg/AdaBreg), dynamically updating the regularization parameter based on the deviation between current and target sparsity levels. This approach enables accurate sparsity control without laborious manual tuning. Evaluated on ECAPA-TDNN and ResNet34 architectures across VoxCeleb and CNCeleb datasets, our method consistently achieves target sparsity levels of 75%–99%, converges faster, and attains equal or better equal error rates compared to finely tuned baselines, while demonstrating enhanced robustness to out-of-distribution data.
📝 Abstract
Sparse training reduces the memory and computational costs of deep neural networks. However, sparse optimization methods, e.g., those adding an $\ell_1$ penalty, often control sparsity only indirectly through a regularization parameter $λ$, whose mapping to the final sparsity rate is non-trivial. In our experiments, we found this parameter sensitivity to be particularly pronounced for Bregman-based optimizers. Specifically, the two variants LinBreg and AdaBreg reach the same sparsity at $λ$ values that differ by up to two orders of magnitude, requiring expensive trial-and-error sweeps to achieve a user-specified sparsity. To address this, we propose an adaptive regularization scheme that updates $λ$ based on the difference between the model's current sparsity and the target sparsity. We analyze the resulting algorithm and evaluate it on automatic speaker verification with ECAPA-TDNN and ResNet34 on VoxCeleb and CNCeleb. The proposed method reliably achieves sparsity targets ranging between 75% and 99%. It also converges faster than the oracle-tuned non-adaptive baseline during early training and matches or surpasses its final performance in equal error rate. We further show that the adaptive scheme inherits key properties from its non-adaptive counterpart, including improved out-of-distribution robustness over the dense baselines.