🤖 AI Summary
To address the low training efficiency and high computational/memory overhead of block-sparse models, this paper proposes an end-to-end differentiable training framework that abandons the conventional “dense-then-prune” paradigm, instead initializing directly from a sparse structure and dynamically optimizing block sizes during training. The method introduces three key components: gradient updates under structured sparsity constraints, block-level adaptive mask learning, and sparse-dense hybrid forward/backward propagation—ensuring both hardware compatibility and training stability. Experiments across multiple benchmarks demonstrate 40–65% reductions in computation and memory usage while matching the accuracy of dense baselines; moreover, the framework enables automatic block-size search. To our knowledge, this is the first structured sparse training approach that jointly optimizes block dimensions with model parameters during end-to-end training.
📝 Abstract
Large-scale machine learning (ML) models are increasingly being used in critical domains like education, lending, recruitment, healthcare, criminal justice, etc. However, the training, deployment, and utilization of these models demand substantial computational resources. To decrease computation and memory costs, machine learning models with sparse weight matrices are widely used in the literature. Among sparse models, those with special sparse structures (e.g., models with block-wise sparse weight matrices) fit better with the hardware accelerators and can decrease the memory and computation costs during the inference. Unfortunately, while there are several efficient training methods, none of them are designed to train a block-wise sparse model efficiently. As a result, the current methods for training block-wise sparse models start with full and dense models leading to inefficient training. In this work, we focus on training models with extit{block-wise sparse matrices} and propose an efficient training algorithm to decrease both computation and memory costs during training and inference. In addition, we will show that our proposed method enables us to efficiently find the right block size for the sparsity pattern during the training process. Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.