🤖 AI Summary
This work addresses the inefficiencies in training composite large language models—such as those involving knowledge distillation and multimodal LLMs—where static component heterogeneity and dynamic workload irregularity severely limit GPU utilization under conventional frameworks. To overcome these dual challenges, the authors propose Maestro, the first training framework that simultaneously tackles both issues. Maestro introduces a segment-centric, coarse-grained segment graph to model component-level resource requirements and integrates a wavefront scheduling algorithm that dynamically reorders samples while adaptively tuning micro-batch sizes and data parallelism. This co-optimization of resource allocation and runtime scheduling yields substantial efficiency gains: in experiments spanning millions of GPU hours, Maestro reduces GPU consumption by approximately 40% on key training tasks, significantly accelerating model training.
📝 Abstract
Compound LLM training workloads-such as knowledge distillation and multimodal LLM (MLLM) training-are gaining prominence. These typically comprise heterogeneous components differing in parameter scale, execution mode (forward-only or full forward-backward), and sequence length. Besides, component activation can be data-dependent: in MLLM training, modality-specific parts activate only when inputs contain corresponding modalities, causing dynamic computational paths and irregular runtime workloads. Conventional frameworks, designed for monolithic models, cannot handle the dual heterogeneity-static (across components) and dynamic (runtime). By enforcing one-size-fits-all training configurations across components and ignoring input-induced variations, they suffer suboptimal throughput and poor GPU utilization. In this paper, we introduce Maestro, a section-centric training framework that addresses both challenges. Maestro first restructures the workload into a coarse-grained section graph. Each section independently configures its parallelism strategy, micro-batch size, and data-parallel degree-enabling fine-grained, component-aware resource allocation to tackle static heterogeneity. To tackle runtime irregularity, Maestro introduces a wavefront scheduling algorithm that dynamically reorders input samples to orchestrate concurrent section execution while preserving cross-section data dependencies. This maximizes inter-section parallelism and minimizes stalls, boosting hardware utilization. Deployed in production for millions of GPU hours, Maestro reduces GPU consumption by ~40% on key workloads-including knowledge distillation and MLLM training-validating its real-world impact.