Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

During large-scale data-parallel training of long-sequence large language models (LLMs), dynamically varying sequence lengths cause severe load imbalance: (1) heterogeneous compute-to-communication ratios across sequence lengths in distributed attention; (2) inflexible static NIC-GPU binding ill-suited for dynamic workloads; and (3) divergent partitioning requirements between quadratic-complexity attention and linear layers. To address these challenges, we propose Zeppelin—a system featuring a novel hierarchical sequence partitioning mechanism, a bandwidth-aware routing layer, and dynamic sequence layout remapping. Zeppelin further introduces an efficient attention engine supporting heterogeneous parallelism. Extensive experiments demonstrate that Zeppelin achieves an average 2.80× speedup across diverse configurations, significantly improving training efficiency and resource utilization for long-sequence LLMs.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) with increasingly long and varying sequence lengths introduces severe load imbalance challenges in large-scale data-parallel training. Recent frameworks attempt to mitigate these issues through data reorganization or hybrid parallel strategies. However, they often overlook how computational and communication costs scale with sequence length, resulting in suboptimal performance. We identify three critical challenges: (1) varying computation-to-communication ratios across sequences of different lengths in distributed attention, (2) mismatch between static NIC-GPU affinity and dynamic parallel workloads, and (3) distinct optimal partitioning strategies required for quadratic attention versus linear components. To address these challenges, we present Zeppelin, a novel training system that integrates three key techniques: (1) a hierarchical sequence partitioning method for the attention module that reduces communication overhead and balances computation, supported by an efficient attention engine that applies divergent parallel strategies; (2) a routing layer that orchestrates inter-node transfers to fully utilize NIC bandwidth; and (3) a remapping layer that transforms sequence layouts between attention and linear modules, ensuring high computational efficiency across both. Comprehensive evaluations across diverse configurations show that Zeppelin delivers an average 2.80x speedup over state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Balancing variable-length workloads in large model training

Addressing load imbalance in data-parallel LLM training

Optimizing computation-communication ratios for distributed attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical sequence partitioning balances computation and communication

Routing layer maximizes NIC bandwidth for inter-node transfers

Remapping layer optimizes layouts between attention and linear modules

🔎 Similar Papers

No similar papers found.