🤖 AI Summary
Modern GPU systems suffer from inflexible static resource management, hindering efficient adaptation to diverse workloads: user-space runtimes lack cross-tenant visibility and hardware control, while kernel-level modifications introduce security vulnerabilities and maintenance overhead. This paper introduces the first eBPF-based policy runtime for GPUs, abstracting GPU drivers and hardware as a programmable OS subsystem. Our key contributions are: (1) a lightweight device-side eBPF virtual machine enabling safe execution of verified policies within the GPU kernel; and (2) a secure, driver-level hooking mechanism that jointly ensures programmability, fine-grained hardware control, and multi-tenant observability. Evaluated on inference, training, and vector search workloads, our approach achieves up to 4.8× higher throughput and 2× lower tail latency. Policy deployment requires zero application modification and zero driver restarts, with runtime overhead under 3%.
📝 Abstract
Performance in modern GPU-centric systems depends increasingly on resource management policies, such as memory placement, scheduling, and observability. However, a one-size-fits-all policy performs poorly across diverse workloads. Existing approaches present a tradeoff: user-space runtimes offer programmability but lack cross-tenant visibility and fine-grained hardware control, while OS kernel modification introduce complexity and safety risks. To address this, we argue that the GPU driver and device layer must serve as an extensible OS policy interface. The emerging eBPF offers a possibility, but naively transplanting host-side eBPF is insufficient: it cannot observe critical device-side events, and directly injecting policy code into GPU kernels affects safety and efficiency.
We present gpu_ext, an eBPF-based policy runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers to expose safe hooks and introduces a device-side eBPF runtime that executes verified policy logic within GPU kernels, enabling coherent, application-transparent policies. Evaluation on realistic workloads, including inference, training, and vector search, shows that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x with low overhead, without modifying applications or restarting drivers.