VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the underutilization of asynchronous hardware units in modern GPUs, which stems from their reliance on monolithic kernel programming models. To overcome this limitation, the paper introduces Virtual Decoupled Cores (VDCores)—the first decoupled programming and execution model tailored for asynchronous GPU architectures. VDCores abstract asynchronous hardware through resource-isolated virtual kernels, express computation and memory traffic as dependency-linked micro-operations, and leverage a runtime system to automatically schedule these operations for overlap based on data dependencies and resource availability. Evaluated on GH200, H100, and RTX 6000 Ada platforms, the approach achieves an average 24% improvement in LLM decoding throughput—reaching up to 77% under dynamic input scenarios—while reducing kernel development effort by 90%.

📝 Abstract

Modern GPUs increasingly rely on specialized and asynchronous hardware units to deliver high performance. Yet these units are often underutilized because today's GPU software stacks still organize programming and execution around a monolithic kernel model that mismatches asynchronous hardware. To address this issue, Virtual Decoupled Engines (VDCores) presents a new decoupled programming and execution model for asynchronous GPUs. VDCores abstracts asynchronous hardware execution units as resource isolated virtual cores and represents workloads as dependency-connected micro-operations (micro-ops). this abstraction removes static orchestration from the programmer, enables automatic overlap of memory and compute based on dependency and resource readiness, and thereby improves utilization of asynchronous hardware resources. Realizing such a decoupled abstraction efficiently on today's GPUs is itself challenging, VDCores addresses this through a GPU-specialized programming model and GPU runtime design that preserves the flexibility while minimizing implementation overhead. Across four LLM inference workloads on GH200, H100, and RTX 6000 Pro GPUs, VDCores significantly improves decoding throughput by 24% on average and by up to 77% under dynamic inputs, while reducing kernel programming and specialization effort by 90%. We have open sourced VDCores at https://github.com/vdcores/vdcores.

Problem

Research questions and friction points this paper is trying to address.

asynchronous GPU

resource underutilization

monolithic kernel model

hardware-software mismatch

GPU programming model

Innovation

Methods, ideas, or system contributions that make the work stand out.

VDCores

asynchronous GPU

resource decoupling