An Efficient Training Algorithm for Models with Block-wise Sparsity

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low training efficiency and high computational/memory overhead of block-sparse models, this paper proposes an end-to-end differentiable training framework that abandons the conventional “dense-then-prune” paradigm, instead initializing directly from a sparse structure and dynamically optimizing block sizes during training. The method introduces three key components: gradient updates under structured sparsity constraints, block-level adaptive mask learning, and sparse-dense hybrid forward/backward propagation—ensuring both hardware compatibility and training stability. Experiments across multiple benchmarks demonstrate 40–65% reductions in computation and memory usage while matching the accuracy of dense baselines; moreover, the framework enables automatic block-size search. To our knowledge, this is the first structured sparse training approach that jointly optimizes block dimensions with model parameters during end-to-end training.

Technology Category

Application Category

📝 Abstract
Large-scale machine learning (ML) models are increasingly being used in critical domains like education, lending, recruitment, healthcare, criminal justice, etc. However, the training, deployment, and utilization of these models demand substantial computational resources. To decrease computation and memory costs, machine learning models with sparse weight matrices are widely used in the literature. Among sparse models, those with special sparse structures (e.g., models with block-wise sparse weight matrices) fit better with the hardware accelerators and can decrease the memory and computation costs during the inference. Unfortunately, while there are several efficient training methods, none of them are designed to train a block-wise sparse model efficiently. As a result, the current methods for training block-wise sparse models start with full and dense models leading to inefficient training. In this work, we focus on training models with extit{block-wise sparse matrices} and propose an efficient training algorithm to decrease both computation and memory costs during training and inference. In addition, we will show that our proposed method enables us to efficiently find the right block size for the sparsity pattern during the training process. Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.
Problem

Research questions and friction points this paper is trying to address.

Efficient training for block-wise sparse ML models
Reducing computation and memory costs in training
Optimizing block size for sparsity during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient training algorithm for block-wise sparsity
Optimizes computation and memory costs
Dynamically finds optimal block sparsity pattern