🤖 AI Summary
Autoregressive models suffer from high generation latency due to sequential, token-by-token dependency, limiting inference efficiency. This paper proposes a context-aware pipelined parallel decoding architecture: the output sequence is partitioned into multiple subsequences, each generating one new token synchronously per step, with a lightweight context synchronization mechanism ensuring coherence. The method introduces no additional parameters or KV cache overhead, preserving autoregressive modeling capability and generation quality while enabling multi-token parallelism. Experiments on question answering, summarization, and keyword generation show 1.8–2.3× speedup in inference latency, with negligible degradation (<0.5 points) in BLEU and ROUGE scores and virtually unchanged memory footprint. The core contribution is the first integration of a structured pipelined mechanism into autoregressive decoding—achieving high-quality parallel generation without any memory overhead.
📝 Abstract
As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.