🤖 AI Summary
To address the challenge of manual partitioning and the difficulty in jointly optimizing energy efficiency and performance for irregular workloads on heterogeneous GPU+FPGA systems, this paper proposes DyPe, a dynamic, data- and operator-parallelism–aware adaptive scheduling framework. Its core contribution is the first data-driven, multi-objective, multi-constraint dynamic mapping optimization method capable of Pareto-optimal configuration search. DyPe integrates runtime data-feature analysis, cross-device operator-level parallel scheduling, peer-to-peer (P2P) memory transfer optimization, and holistic heterogeneous resource modeling. Evaluated across 86 real-world scenarios on a physical platform, DyPe achieves globally optimal performance in 77 cases. On average, it improves throughput by 1.53× over static scheduling and 1.44× over GPU-only execution, while enhancing energy efficiency by 1.09× and 1.66×, respectively.
📝 Abstract
Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive experimentation and human effort to identify the tasks suitable for the accelerator. To solve this problem, we introduce DyPe, a scheduling framework tailored for heterogeneous systems with specialized accelerators. Our method automatically partitions, deploys, and reschedules execution when necessary by dynamically analyzing the characteristics of the input data and leveraging the interoperator parallelism among heterogeneous devices. DyPe navigates a multi-objective, multi-constraint design space that considers both system constraints and application requirements, which allows it to discover Pareto-optimal mapping configurations, improving the system's overall performance and effectively managing energy-performance trade-offs. To demonstrate the benefits of our approach on real hardware, we build a heterogeneous system of GPUs and FPGAs with peer-to-peer data transfers. The experiments show that conventional static scheduling is optimal for 13 out of 86 cases for different workloads and system settings while DyPe is adaptable and able to find the optimal schedule in 77 out of 86 cases, with an average of only 3.95% performance or energy efficiency loss in the sub-optimal cases. Performance evaluation of DyPe shows an average of 1.53x throughput and 1.09x energy efficiency improvement over the static schedule baseline and 1.44x throughput and 1.66x energy efficiency over the GPU-only baseline.