A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenges of stage misalignment, pipeline bubbles, and low resource utilization in pipeline-parallel training caused by fluctuations in computation and communication. To this end, the authors propose a task-readiness-driven dynamic scheduling mechanism that treats scheduling order as a non-binding hint and integrates message-driven asynchronous communication, lightweight tensor-parallel consistency coordination, and readiness-set arbitration to achieve low-overhead, highly adaptive runtime scheduling. Evaluated on a 128-GPU cluster, the proposed approach maintains training correctness while achieving up to 1.84× speedup over existing systems, with language models and multimodal models accelerating by 1.77× and 2.77×, respectively.

📝 Abstract

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.

Problem

Research questions and friction points this paper is trying to address.

pipeline parallelism

runtime variability

task readiness

schedule divergence

stage misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

pipeline parallelism

runtime variability

readiness-driven scheduling