Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories

πŸ“… 2025-08-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address HBM bandwidth saturation and computational bottlenecks caused by KV cache management in long-context LLM inference, this work proposes a pack-and-prefetch co-scheduling architecture, integrated with 3D monolithic stacking and BEOL-embedded memory to realize ultra-large-capacity on-chip KV cacheβ€”enabling, for the first time, hardware-level spatial and temporal reuse of long-context KV caches. Leveraging dynamic request packing, fine-grained prefetching, and TPU-like system-level co-optimization, the design significantly alleviates memory access pressure. Evaluated on Llama3.1-8B, it achieves an 8.06Γ— decoding speedup and a 1.83Γ— reduction in end-to-end latency. Under multi-request workloads, throughput improves by 1.7–2.4Γ— while HBM bandwidth consumption decreases by 1.5–2.4Γ—.

Technology Category

Application Category

πŸ“ Abstract
Long-context Large Language Model (LLM) inference faces increasing compute bottlenecks as attention calculations scale with context length, primarily due to the growing KV-cache transfer overhead that saturates High Bandwidth Memory (HBM). While prefetching techniques mitigate cache misses by fetching KV data in advance, their spatial and temporal benefits present new opportunities to exploit. This work proposes a packing-prefetch scheduling architecture with monolithic 3D (M3D) back-end-of-line (BEOL) compatible embedded memories with ultra-large on-chip capacity to accelerate long-context LLM inference. Our optimizations demonstrate 8.06x decode speedup and 1.83x overall latency reduction on Llama3.1-8B using TPUv6e-like hardware with additional 512MB BEOL memories over the serial execution. Evaluations of multi-request workloads on TPU-like architectures show 1.7x-2.4x throughput improvement and 1.5x-2.4x HBM bandwidth reduction compared to packing-only methods on Llama3.1-8B and Llama3.1-70B models. With the co-design of packing, prefetching, and BEOL memories, our approach alleviates HBM constraints and enables efficient long-context LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV-cache transfer overhead in long-context LLM inference
Mitigating High Bandwidth Memory saturation during attention calculations
Accelerating long-context processing despite scaling computational bottlenecks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Packing-prefetch scheduling architecture for LLM acceleration
Monolithic 3D BEOL compatible embedded ultra-large memories
Co-design approach reducing HBM bandwidth constraints
πŸ”Ž Similar Papers
No similar papers found.