ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

πŸ“… 2025-08-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high inference latency of large language models (LLMs) induced by autoregressive decoding, this work identifies and exploits an inherent parallel-segment structure in LLM-generated textβ€”marking the first such discovery. We propose a non-intrusive, training-free adaptive parallel decoding framework that automatically extracts and validates parallel segments to construct a hybrid decoding engine, dynamically switching between sequential and parallel modes while jointly optimizing KV cache reuse and management. Evaluated across multiple benchmarks, our method achieves up to 3.19Γ— (average 1.85Γ—) inference speedup with less than 1% degradation in generation quality, significantly improving efficiency without compromising performance. The core contribution lies in revealing the intrinsic parallelism in LLM outputs and designing a lightweight, general-purpose, plug-and-play parallel decoding paradigm grounded in this insight.

Technology Category

Application Category

πŸ“ Abstract
The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in large language models
Identifying parallelizable structures in autoregressive outputs
Enabling seamless serial-parallel decoding transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Serial-Parallel Decoding (ASPD) for LLMs
Non-invasive pipeline extracts parallelizable structures
Hybrid Decoding Engine enables seamless mode transitions
πŸ”Ž Similar Papers
No similar papers found.