A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the challenges of stage misalignment, pipeline bubbles, and low resource utilization in pipeline-parallel training caused by fluctuations in computation and communication. To this end, the authors propose a task-readiness-driven dynamic scheduling mechanism that treats scheduling order as a non-binding hint and integrates message-driven asynchronous communication, lightweight tensor-parallel consistency coordination, and readiness-set arbitration to achieve low-overhead, highly adaptive runtime scheduling. Evaluated on a 128-GPU cluster, the proposed approach maintains training correctness while achieving up to 1.84× speedup over existing systems, with language models and multimodal models accelerating by 1.77× and 2.77×, respectively.
📝 Abstract
Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.
Problem

Research questions and friction points this paper is trying to address.

pipeline parallelism
runtime variability
task readiness
schedule divergence
stage misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

pipeline parallelism
runtime variability
readiness-driven scheduling
asynchronous communication
tensor-parallel coordination
🔎 Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5