🤖 AI Summary
Autoregressive decoding in large language models incurs high latency, and existing multi-token prediction methods rely on strong independence assumptions that limit modeling fidelity. Method: We propose Parallel Token Prediction (PTP), the first framework to internalize the sampling process into the model architecture, enabling joint generation of multiple semantically coherent tokens in a single Transformer forward pass while strictly preserving expressivity over any autoregressive distribution—thereby eliminating restrictive independence assumptions. PTP integrates inverse autoregressive training with explicit sampling modeling and supports both teacher-free and distillation-based training. Results: Evaluated on Vicuna-7B, PTP achieves 4.12 average accepted tokens per speculative step on Spec-Bench, maintains full modeling capability for long-sequence generation, and attains state-of-the-art performance in speculative decoding.
📝 Abstract
We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.