🤖 AI Summary
To address vocabulary-layer-induced computational and memory load imbalance in large language model pipeline parallelism—causing pipeline bubbles and GPU memory bottlenecks—this work presents the first systematic analysis of its root causes and proposes *Vocabulary Parallelism*. Our approach uniformly partitions the vocabulary across pipeline stages, designs low-overhead AllGather/ReduceScatter optimization algorithms, and introduces a generalized scheduling adaptation framework to jointly balance computation, parameter memory, and activation memory. Under schedulers such as V-Half, it achieves perfect load balancing, boosting throughput by 5%–51% and significantly reducing peak GPU memory—especially for large vocabularies—while enabling stable scalability across arbitrary vocabulary sizes. The implementation is open-sourced.
📝 Abstract
Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .