🤖 AI Summary
This work addresses the poorly understood mechanism underlying continuous sparsification for efficient large-model inference compression. From a learning dynamics perspective, we theoretically characterize its implicit regularization evolution: an initial L₂-biased phase transitions gradually to an L₁-sparse preference in later stages. Based on this insight, we propose PILoT—a parameter-controllable sparse training paradigm—featuring a novel time-varying Bregman potential control mechanism that actively steers the implicit bias trajectory. Within the mirror flow framework, we establish rigorous convergence and optimality guarantees. Our theoretical analysis is the first to explain why implicit L₁ regularization outperforms explicit L₁ regularization. Extensive experiments demonstrate that PILoT consistently surpasses baselines across standard benchmarks, achieving superior accuracy–sparsity trade-offs and validating the efficacy of theory-driven design.
📝 Abstract
Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit $L_1$ regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit $L_1$ regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit $L_2$ regularization that gradually transitions to an $L_1$ penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.