Pipelined Decoder for Efficient Context-Aware Text Generation

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive models suffer from high generation latency due to sequential, token-by-token dependency, limiting inference efficiency. This paper proposes a context-aware pipelined parallel decoding architecture: the output sequence is partitioned into multiple subsequences, each generating one new token synchronously per step, with a lightweight context synchronization mechanism ensuring coherence. The method introduces no additional parameters or KV cache overhead, preserving autoregressive modeling capability and generation quality while enabling multi-token parallelism. Experiments on question answering, summarization, and keyword generation show 1.8–2.3× speedup in inference latency, with negligible degradation (<0.5 points) in BLEU and ROUGE scores and virtually unchanged memory footprint. The core contribution is the first integration of a structured pipelined mechanism into autoregressive decoding—achieving high-quality parallel generation without any memory overhead.

Technology Category

Application Category

📝 Abstract
As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive models limit text generation speed due to sequential token processing
Proposing a pipelined decoder for parallel context-aware text generation
Enhancing speed without compromising quality or increasing memory usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipelined decoder enables parallel text generation
Generates multiple subsequences simultaneously for efficiency
Maintains quality without extra memory usage
🔎 Similar Papers
No similar papers found.
Zixian Huang
Zixian Huang
Shanghai AI Lab
Question AnsweringNatural Language Processing
C
Chenxu Niu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Y
Yu Gu
The Ohio State University, Columbus, USA
G
Gengyang Xiao
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
X
Xinwei Huang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Gong Cheng
Gong Cheng
Professor, Nanjing University
big data searchknowledge graphLLM inference