HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Transformer-based large language models (LLMs) suffer from low arithmetic intensity in attention layers, severe memory bandwidth bottlenecks due to KV cache access, and poor GPU resource utilization under long-sequence, high-batch inference. Method: This paper proposes a GPU–HPU heterogeneous acceleration architecture, where an FPGA-based High-bandwidth Processing Unit (HPU) connected via PCIe serves as a GPU co-processor. It enables hierarchical KV cache offloading, heterogeneous memory co-scheduling, and decoupling of computation and memory tasks—without requiring additional GPUs. Contribution/Results: Experiments demonstrate up to 4.1× higher throughput and 4.6× better energy efficiency versus GPU-only systems, alongside significant reductions in latency and power consumption. To our knowledge, this work introduces the first scalable hardware-coordinated execution paradigm specifically designed for LLM inference, offering a novel pathway toward efficient large-model deployment.

Technology Category

Application Category

📝 Abstract

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiencies in GPU systems for Transformer-based LLM inference

Proposes HPU to offload memory-bound operations from GPUs

Enhances scalability and efficiency for large-batched LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

HPU offloads memory-bound operations from GPU

HPU scales memory for large batches efficiently

HPU-GPU system boosts performance and energy efficiency

🔎 Similar Papers

No similar papers found.