Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
SYCL programs on multi-GPU clusters suffer from high scheduling latency and substantial critical-path overhead due to implicit memory allocation, cache-coherence operations, and dependency analysis. Method: We propose the Instruction Graph—a novel intermediate representation that fully decouples scheduling from execution. Our approach integrates speculative scheduling, adaptive virtual-buffer memory allocation, and tight integration with the Celerity runtime, enabling fully concurrent scheduling of memory management, data transfers, MPI communication, and kernel launches while moving all scheduling analysis off the critical execution path. Contribution/Results: Evaluated on a production-scale 128-GPU cluster, our method achieves excellent strong scaling, drastically reduces multi-application scheduling latency, and drives critical-path overhead nearly to zero—thereby overcoming fundamental limitations of conventional static and blocking schedulers.

Technology Category

Application Category

📝 Abstract
Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how graph-based intermediate representations help moving such scheduling work out of the critical path. In the context of SYCL programs distributed onto accelerator clusters, we introduce the instruction graph, a low-level representation that preserves full concurrency between memory management, data transfers, MPI peer-to-peer communication and kernel invocation. Through integration within the Celerity runtime, we demonstrate how instruction-graph scheduling enables a system architecture that performs this analysis concurrently with execution. Using a scheduler lookahead mechanism, we further detect changing access patterns to optimize memory allocation in the presence of virtualized buffers. We show the effectiveness of our method through strong-scaling benchmarks with multiple Celerity applications on up to 128 GPUs in a production cluster.
Problem

Research questions and friction points this paper is trying to address.

Optimizing scheduling for high-level parallel programs on multi-GPU systems.
Reducing delays in distributed-memory applications through graph-based representations.
Enhancing memory allocation and concurrency in SYCL programs on accelerator clusters.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based intermediate representations optimize scheduling.
Instruction graph enables concurrent memory and kernel management.
Scheduler lookahead optimizes memory for virtualized buffers.
🔎 Similar Papers
No similar papers found.