Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

Existing distributed training systems for large language models lack systematic modeling of computation-communication overlap, micro-batch load imbalance, and the co-design of memory optimization techniques with parallelization strategies, leading to suboptimal performance. Method: We propose the first fully automated distributed training tuning system, integrating fine-grained overlap-aware scheduling, symbolic performance modeling, and a hierarchical imbalance-aware optimization framework. It jointly optimizes tensor, pipeline, and data parallelism with full memory optimizations—including activation checkpointing, redundancy elimination, and offloading—via symbolic execution analysis and bi-objective mixed-integer linear programming under memory constraints to maximize throughput. Contribution/Results: Our system achieves average speedups of 1.28× and 1.27× over Megatron-LM and Aceso, respectively, with peak improvements of 1.73× and 2.04×, significantly enhancing resource utilization and training efficiency.

Technology Category

Application Category

📝 Abstract

Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28$ imes$ (up to 1.73$ imes$) and 1.27$ imes$ (up to 2.04$ imes$) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.

Problem

Research questions and friction points this paper is trying to address.

Optimize distributed training for Large Language Models

Address lack of overlap awareness in tuning

Solve inter-microbatch imbalance in parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained overlap-centric scheduling for optimization orchestration

Symbolic-based performance analysis for fast tuning

Imbalance-aware hierarchical tuning via MILP and dual-objective optimization

🔎 Similar Papers

No similar papers found.