🤖 AI Summary
To address the high hardware overhead, substantial memory footprint, and elevated training costs induced by autoregressive decoding in large language models (LLMs), this paper proposes Hardware-aware Parallel Prompt Decoding (HPD). HPD introduces three key innovations: (1) a lightweight, multi-step parallel output approximation mechanism inspired by human language generation; (2) a dynamic sparse tree structure coupled with hardware-aware scheduling to jointly optimize computational efficiency and memory bandwidth utilization; and (3) orthogonal compatibility with existing speculative decoding frameworks, boosting long-range prediction acceptance rate by 28%. HPD requires only 0.0002% trainable parameters and achieves up to 2.49× end-to-end speedup across models from MobileLlama to Vicuna-13B, with merely 0.0004% runtime memory overhead. Training completes in 16 hours on a single A100-40GB GPU.
📝 Abstract
The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments, such as memory consumption and training cost. To overcome these limitations, we propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process, $PPD$ approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49$ imes$ speedup and maintains a minimal runtime memory overhead of just $0.0004$%. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to $1.22 imes$ further speed improvement. Our code is available at https://github.com/hmarkc/parallel-prompt-decoding.