๐ค AI Summary
Structured sparse training enables GPU acceleration but suffers from limited expressivity due to fixed sparsity patterns (e.g., block, N:M, diagonal), resulting in substantial accuracy degradation compared to unstructured dynamic sparse training (DST). To address this, we propose Permutation-Augmented Dynamic Sparse Training (PA-DST), which jointly learns, per layer, a differentiable and optimizable permutation matrix alongside structured sparse weightsโenabling dynamic rearrangement of weight locations to significantly enhance representational capacity. PA-DST is compatible with mainstream structured sparsity patterns and maintains high sparsity (90%โ95%). On ImageNet-1K and WikiText-103, it matches the accuracy of state-of-the-art unstructured DST methods (e.g., RigL, SET), while accelerating training by 1.21ร and inference by up to 2.9ร.
๐ Abstract
Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21 imes$ and infers up to $2.9 imes$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.