Balanced and Elastic End-to-end Training of Dynamic LLMs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Dynamic large language model (LLM) training—enabling techniques such as Mixture of Experts (MoE), early-token exiting, sparse attention, and parameter pruning—introduces severe load imbalance across devices. To address this, we propose DynMo, the first elastic pipeline parallel scheduling framework tailored for dynamically structured models. Its core innovation is an adaptive load-balancing mechanism that supports dynamic task batching, elastic GPU resource allocation, fine-grained computational graph remapping, and cross-node scaling. Departing from conventional static parallelism, DynMo enables end-to-end autonomous scheduling and resource reuse. Extensive experiments on single- and multi-node multi-GPU systems demonstrate substantial speedups across diverse dynamic training paradigms: 4.52× for early-token exiting, 4.02× for sparse attention, and 3.18× for parameter pruning.

Technology Category

Application Category

📝 Abstract

To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs). DynMo is available at https://anonymous.4open.science/r/DynMo-4D04/.

Problem

Research questions and friction points this paper is trying to address.

Dynamic LLMs face workload imbalance in distributed training

Existing methods limit scalability due to compute inefficiency

DynMo optimizes load balancing for faster dynamic model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous dynamic load balancing for LLMs

Adaptive workload and task packing

Supports multi-GPU and multi-node systems

🔎 Similar Papers

No similar papers found.