Pipelined Decoder for Efficient Context-Aware Text Generation

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Autoregressive models suffer from high generation latency due to sequential, token-by-token dependency, limiting inference efficiency. This paper proposes a context-aware pipelined parallel decoding architecture: the output sequence is partitioned into multiple subsequences, each generating one new token synchronously per step, with a lightweight context synchronization mechanism ensuring coherence. The method introduces no additional parameters or KV cache overhead, preserving autoregressive modeling capability and generation quality while enabling multi-token parallelism. Experiments on question answering, summarization, and keyword generation show 1.8–2.3× speedup in inference latency, with negligible degradation (<0.5 points) in BLEU and ROUGE scores and virtually unchanged memory footprint. The core contribution is the first integration of a structured pipelined mechanism into autoregressive decoding—achieving high-quality parallel generation without any memory overhead.

Technology Category

Application Category

📝 Abstract

As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.

Problem

Research questions and friction points this paper is trying to address.

Autoregressive models limit text generation speed due to sequential token processing

Proposing a pipelined decoder for parallel context-aware text generation

Enhancing speed without compromising quality or increasing memory usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipelined decoder enables parallel text generation

Generates multiple subsequences simultaneously for efficiency

Maintains quality without extra memory usage

🔎 Similar Papers

FutureFill: Fast Generation from Convolutional Sequence Models