Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the low GPU utilization and performance interference caused by coarse-grained resource allocation under mixed HPC, AI, and data analytics workloads. To bridge the granularity gap between static Multi-Instance GPU (MIG) partitions and the actual resource demands of applications, the authors propose a novel co-design mechanism that integrates static MIG partitioning with fine-grained CPU memory offloading, leveraging cache-coherent NVLink-C2C interconnects for the first time. System-level evaluations across representative real-world applications—including NekRS, LAMMPS, Llama3, and Qiskit—demonstrate that the proposed approach significantly reduces GPU idle time, improves throughput and energy efficiency, and effectively mitigates performance degradation due to resource contention in shared environments.

Technology Category

Application Category

📝 Abstract

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.

Problem

Research questions and friction points this paper is trying to address.

GPU underutilization

Multi-Instance GPU

resource partitioning

HPC workloads

resource interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Instance GPU

resource underutilization

fine-grained offloading