Dynamic Sparse Training of Diagonally Sparse Networks

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited hardware acceleration of unstructured sparsity in Dynamic Sparse Training (DST), this paper proposes DynaDiag—the first DST method that maintains a fixed diagonal-sparse structure throughout training. DynaDiag integrates dynamic connection updates with hardware-friendly diagonal sparsity constraints and introduces custom CUDA kernels for efficient sparse forward and backward passes. Evaluated on Vision Transformers (ViTs), DynaDiag achieves full-dense model accuracy at 90% sparsity, accelerating inference and training by 3.13× and 1.59×, respectively—substantially outperforming general-purpose sparse methods while matching the accuracy of unstructured DST approaches. This work is the first to simultaneously achieve high accuracy, computational efficiency, and hardware compatibility under extreme sparsity, thereby unifying these three critical objectives in sparse deep learning.

Technology Category

Application Category

📝 Abstract
Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers. Our source code is available at https://github.com/horizon-research/DynaDiag/.
Problem

Research questions and friction points this paper is trying to address.

Improving hardware efficiency of sparse neural networks
Enforcing diagonal sparsity for structured computation
Achieving speedups in training and inference without accuracy loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagonal sparsity pattern for structured DST
Custom CUDA kernel for hardware acceleration
Maintains accuracy with computational speedups
A
Abhishek Tyagi
Department of Computer Science, University of Rochester, Rochester, NY , USA
A
A. Iyer
The Institute of Optics, University of Rochester, Rochester, NY , USA
W
W. Renninger
The Institute of Optics, University of Rochester, Rochester, NY , USA
Christopher Kanan
Christopher Kanan
University of Rochester
Artificial IntelligenceDeep LearningAGIMulti-Modal AICognitive Science
Y
Yuhao Zhu
Department of Computer Science, University of Rochester, Rochester, NY , USA