HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Existing accelerators inadequately address the substantial disparity in computational and memory demands between the prefill and decoding phases of low-batch, long-context LLM inference. Method: This paper proposes a heterogeneous in-memory acceleration architecture based on 2.5D integration, uniquely combining HBM-level digital compute-in-DRAM (CiD) with on-chip analog compute-in-memory (CiM), and introduces a phase-aware task mapping strategy that dynamically adapts to the distinct workload characteristics of each inference phase. Results: Evaluated on LLaMA-2 7B and Qwen3 8B, the design achieves geometric mean speedups of 18× and 2× over AttAcc and CENT, respectively, significantly improving hardware utilization and energy efficiency. Key contributions include: (i) the first CiD+CiM heterogeneous integration; (ii) a dynamic, phase-aware mapping mechanism tailored to LLM inference stage characteristics; and (iii) systematic optimization for critical interactive workloads—specifically low-batch, long-context scenarios.

Technology Category

Application Category

📝 Abstract

The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM inference for low-batch interactive applications

Addressing divergent compute and memory needs in prefill and decode phases

Overcoming limitations of high-batch or short-context prior accelerators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous accelerator with 2.5D integration

Phase-aware mapping strategy for compute/memory phases

Combines Compute-in-Memory with Compute-in-DRAM

🔎 Similar Papers

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective