🤖 AI Summary
To address load imbalance between the prefill and decoding phases, low hardware utilization, and degraded Quality-of-Service (QoS) in large language model (LLM) inference services, this paper proposes an automated heterogeneous dataflow architecture exploration framework tailored for LLM serving. The framework introduces a novel template-library-based hardware–software co-design space search methodology, integrating customizable heterogeneous dataflow architecture templates, hardware–software joint modeling, and QoS-aware evaluation to jointly optimize throughput and latency. Evaluated against an NVIDIA A100 GPU, the framework achieves a 2.51× improvement in QoS and a 4.01× gain in area efficiency under high batch sizes. These results significantly enhance scalability and cost-effectiveness of LLM inference services while maintaining rigorous QoS guarantees.
📝 Abstract
The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data demands. The prefill stage processes large numbers of input tokens in parallel, increasing computational load, while the decoding stage relies heavily on memory bandwidth due to the auto-regressive nature of LLMs. Current hardware, such as GPUs, often fails to balance these demands, leading to inefficient utilization. While batching improves hardware efficiency, it delays response times, degrading Quality-of-Service (QoS). This disconnect between vendors, who aim to maximize resource efficiency, and users, who prioritize low latency, highlights the need for a better solution. To address this, we propose ADOR, a framework that automatically identifies and recommends hardware architectures tailored to LLM serving. By leveraging predefined architecture templates specialized for heterogeneous dataflows, ADOR optimally balances throughput and latency. It efficiently explores design spaces to suggest architectures that meet the requirements of both vendors and users. ADOR demonstrates substantial performance improvements, achieving 2.51x higher QoS and 4.01x better area efficiency compared to the A100 at high batch sizes, making it a robust solution for scalable and cost-effective LLM serving.