🤖 AI Summary
To address the coexistence of performance instability and resource underutilization in multi-application GPU co-location, this paper introduces the first kernel-level, fine-grained resource interference quantification framework spanning multiple hardware layers—including compute units, L1/L2 caches, and memory bandwidth—overcoming the limitations of conventional coarse-grained utilization-based modeling. Leveraging micro-benchmarks, hardware performance counter sampling, kernel-level isolation experiments, and interference modeling, the framework enables reproducible characterization of interference behavior across critical subsystems. Based on this, we design a dynamic co-location scheduler with strict service-level objective (SLO) guarantees, achieving over 35% improvement in aggregate GPU utilization while maintaining quality-of-service requirements. This work establishes both theoretical foundations and empirical validation for predictable, high-performance GPU resource sharing.
📝 Abstract
GPU hardware is vastly underutilized. Even resource-intensive AI applications have diverse resource profiles that often leave parts of GPUs idle. While colocating applications can improve utilization, current spatial sharing systems lack performance guarantees. Providing predictable performance guarantees requires a deep understanding of how applications contend for shared GPU resources such as block schedulers, compute units, L1/L2 caches, and memory bandwidth. We propose a methodology to profile resource interference of GPU kernels across these dimensions and discuss how to build GPU schedulers that provide strict performance guarantees while colocating applications to minimize cost.