🤖 AI Summary
To address critical challenges in dynamic sparse training—including inconsistent pruning criteria, difficulty in adapting structured pruning, and myopic growth strategies—this paper proposes a novel sparsification paradigm that decouples “active structures” from the “exploration space.” It unifies importance evaluation for dynamic weight- and channel-level pruning and growth, and introduces an exploitation-exploration two-phase mechanism: brief parameter pre-training prior to exploration enhances reintegration quality; Top-k significance recalibration, parameter freezing/unfreezing, and ERK-based sparsity allocation are integrated. On ImageNet, ResNet-50 achieves a 1.3% Top-1 accuracy gain at 90% ERK sparsity, while reducing training cost by over 70% compared to HALP. The method simultaneously improves model compression ratio, inference speed, and accuracy.
📝 Abstract
Pruning aims to accelerate and compress models by removing redundant parameters, identified by specifically designed importance scores which are usually imperfect. This removal is irreversible, often leading to subpar performance in pruned models. Dynamic sparse training, while attempting to adjust sparse structures during training for continual reassessment and refinement, has several limitations including criterion inconsistency between pruning and growth, unsuitability for structured sparsity, and short-sighted growth strategies. Our paper introduces an efficient, innovative paradigm to enhance a given importance criterion for either unstructured or structured sparsity. Our method separates the model into an active structure for exploitation and an exploration space for potential updates. During exploitation, we optimize the active structure, whereas in exploration, we reevaluate and reintegrate parameters from the exploration space through a pruning and growing step consistently guided by the same given importance criterion. To prepare for exploration, we briefly"reactivate"all parameters in the exploration space and train them for a few iterations while keeping the active part frozen, offering a preview of the potential performance gains from reintegrating these parameters. We show on various datasets and configurations that existing importance criterion even simple as magnitude can be enhanced with ours to achieve state-of-the-art performance and training cost reductions. Notably, on ImageNet with ResNet50, ours achieves an +1.3 increase in Top-1 accuracy over prior art at 90% ERK sparsity. Compared with the SOTA latency pruning method HALP, we reduced its training cost by over 70% while attaining a faster and more accurate pruned model.