SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the severe load imbalance in long-context large language model training caused by the interplay between sequence length heterogeneity and varying sensitivity to sparsity, which jointly hinder training efficiency and model accuracy. To tackle this challenge, we propose SparseBalance, the first algorithm-system co-design framework that holistically co-optimizes these two sources of heterogeneity. SparseBalance introduces a novel bidirectional dynamic sparsity adjustment mechanism and a computation bubble utilization strategy, integrating workload-aware dynamic sparsity tuning, sparsity-aware batching, and distributed sparse attention training. Evaluated on the LongBench benchmark, our approach improves long-context modeling capability by 0.46% and achieves up to 1.33× end-to-end training speedup.

Technology Category

Application Category

📝 Abstract

While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46\% on the LongBench benchmark.

Problem

Research questions and friction points this paper is trying to address.

sparse attention

load imbalance

long-context training

sequence heterogeneity

sparsity sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic sparse attention

load balancing

long-context training