Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

📅 2024-05-28
🏛️ arXiv.org
📈 Citations: 12
Influential: 1
📄 PDF
🤖 AI Summary
To address the high hardware overhead, substantial memory footprint, and elevated training costs induced by autoregressive decoding in large language models (LLMs), this paper proposes Hardware-aware Parallel Prompt Decoding (HPD). HPD introduces three key innovations: (1) a lightweight, multi-step parallel output approximation mechanism inspired by human language generation; (2) a dynamic sparse tree structure coupled with hardware-aware scheduling to jointly optimize computational efficiency and memory bandwidth utilization; and (3) orthogonal compatibility with existing speculative decoding frameworks, boosting long-range prediction acceptance rate by 28%. HPD requires only 0.0002% trainable parameters and achieves up to 2.49× end-to-end speedup across models from MobileLlama to Vicuna-13B, with merely 0.0004% runtime memory overhead. Training completes in 16 hours on a single A100-40GB GPU.

Technology Category

Application Category

📝 Abstract
The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments, such as memory consumption and training cost. To overcome these limitations, we propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process, $PPD$ approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49$ imes$ speedup and maintains a minimal runtime memory overhead of just $0.0004$%. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to $1.22 imes$ further speed improvement. Our code is available at https://github.com/hmarkc/parallel-prompt-decoding.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory consumption in autoregressive LLM inference
Minimizing training costs for speculative decoding techniques
Improving hardware efficiency through parallel prompt decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses parallel prompt tokens for future output approximation
Employs hardware-aware dynamic sparse tree optimization
Requires minimal trainable parameters and memory overhead
🔎 Similar Papers
No similar papers found.
H
Hao Chen
Imperial College London, UK
Wayne Luk
Wayne Luk
Professor of Computer Engineering, Imperial College London
Hardware and ArchitectutreReconfigurable ComputingDesign Automation
K
Ka-Fai Cedric Yiu
Hong Kong Polytechnic University, Hong Kong
R
Rui Li
Samsung AI Center, Cambridge, UK
Konstantin Mishchenko
Konstantin Mishchenko
Meta
Deep LearningOptimization
Stylianos I. Venieris
Stylianos I. Venieris
Senior Research Scientist @ Samsung AI, Cambridge, UK
Deep LearningFPGAsMobile ComputingDesign Automation
H
Hongxiang Fan
Imperial College London, UK; Samsung AI Center, Cambridge, UK