🤖 AI Summary
Current GPU programming models lack expressiveness for chiplet-level locality and synchronization, leading to redundant memory accesses and poor cache utilization when executing memory-intensive workloads such as large language model (LLM) inference on multi-chiplet GPUs. This work proposes Fleet, the first multi-level task programming model that explicitly exposes the chiplet hierarchy. Fleet introduces a chiplet-task abstraction that binds computation and data to specific chiplets and integrates persistent kernels, cooperative weight tiling, and per-chiplet scheduling to enable L2 cache reuse and efficient coordinated execution. Evaluated on an AMD MI350 running Qwen3-8B, Fleet reduces decoding latency by 1.3–1.5× for small batches and cuts HBM traffic by up to 37% under large batches, significantly improving L2 hit rates and achieving overall speedups of 1.27–1.30×.
📝 Abstract
Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads to redundant memory traffic and poor cache utilization in memory-bound workloads such as LLM inference.
We present Fleet, a multi-level task model that maps computation to memory scopes. Fleet introduces Chiplet-tasks, a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse. On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 through persistent kernel execution and per-chiplet scheduling. At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27-1.30x speedup over a chiplet-unaware megakernel baseline.