Parallel Token Prediction for Language Models

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Autoregressive decoding in large language models incurs high latency, and existing multi-token prediction methods rely on strong independence assumptions that limit modeling fidelity. Method: We propose Parallel Token Prediction (PTP), the first framework to internalize the sampling process into the model architecture, enabling joint generation of multiple semantically coherent tokens in a single Transformer forward pass while strictly preserving expressivity over any autoregressive distribution—thereby eliminating restrictive independence assumptions. PTP integrates inverse autoregressive training with explicit sampling modeling and supports both teacher-free and distillation-based training. Results: Evaluated on Vicuna-7B, PTP achieves 4.12 average accepted tokens per speculative step on Spec-Bench, maintains full modeling capability for long-sequence generation, and attains state-of-the-art performance in speculative decoding.

Technology Category

Application Category

📝 Abstract

We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.

Problem

Research questions and friction points this paper is trying to address.

Reduces latency bottleneck in autoregressive decoding

Avoids restrictive independence assumptions in token prediction

Enables parallel generation without loss of modeling power

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Token Prediction for parallel generation

Incorporates sampling into transformer call

Trains via distillation or inverse autoregressive training

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models