gpu_ext: Extensible OS Policies for GPUs via eBPF

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Modern GPU systems suffer from inflexible static resource management, hindering efficient adaptation to diverse workloads: user-space runtimes lack cross-tenant visibility and hardware control, while kernel-level modifications introduce security vulnerabilities and maintenance overhead. This paper introduces the first eBPF-based policy runtime for GPUs, abstracting GPU drivers and hardware as a programmable OS subsystem. Our key contributions are: (1) a lightweight device-side eBPF virtual machine enabling safe execution of verified policies within the GPU kernel; and (2) a secure, driver-level hooking mechanism that jointly ensures programmability, fine-grained hardware control, and multi-tenant observability. Evaluated on inference, training, and vector search workloads, our approach achieves up to 4.8× higher throughput and 2× lower tail latency. Policy deployment requires zero application modification and zero driver restarts, with runtime overhead under 3%.

Technology Category

Application Category

📝 Abstract

Performance in modern GPU-centric systems depends increasingly on resource management policies, such as memory placement, scheduling, and observability. However, a one-size-fits-all policy performs poorly across diverse workloads. Existing approaches present a tradeoff: user-space runtimes offer programmability but lack cross-tenant visibility and fine-grained hardware control, while OS kernel modification introduce complexity and safety risks. To address this, we argue that the GPU driver and device layer must serve as an extensible OS policy interface. The emerging eBPF offers a possibility, but naively transplanting host-side eBPF is insufficient: it cannot observe critical device-side events, and directly injecting policy code into GPU kernels affects safety and efficiency. We present gpu_ext, an eBPF-based policy runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers to expose safe hooks and introduces a device-side eBPF runtime that executes verified policy logic within GPU kernels, enabling coherent, application-transparent policies. Evaluation on realistic workloads, including inference, training, and vector search, shows that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x with low overhead, without modifying applications or restarting drivers.

Problem

Research questions and friction points this paper is trying to address.

Extensible OS policies for GPU resource management

Overcoming limitations of user-space runtimes and kernel modifications

Enabling safe and efficient device-side policy execution via eBPF

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends GPU drivers with safe hooks for eBPF

Introduces device-side eBPF runtime within GPU kernels

Enables coherent, application-transparent policies without modifications

🔎 Similar Papers

VeriFence: Lightweight and Precise Spectre Defenses for Untrusted Linux Kernel Extensions