๐ค AI Summary
Autoregressive language models suffer from low inference efficiency and limited parallelism due to their sequential, one-token-at-a-time generation, especially in semantically constrained later stages. To address this, we propose a novel speculative decoding framework that breaks token-level autoregressive dependency via a masked input paradigm, gated LoRA adaptation, a learnable multi-token sampling module, and a consistency-aware auxiliary lossโenabling parallel prediction of multiple future tokens. Built upon pretrained large language models, our method integrates LoRA fine-tuning, masked attention, and lightweight sequence modeling to support quadratic-scaling speculative generation. Experiments demonstrate up to 4.8ร speedup on code and mathematical reasoning tasks, and 2.5ร acceleration on general dialogue and knowledge-intensive tasks, with no degradation in generation quality (measured by BLEU, CodeBLEU, and MATH scores). Our core contribution is the first explicit modeling of implicit future-prediction capability as a controllable, consistent, and end-to-end trainable multi-step parallel generation mechanism.
๐ Abstract
Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and semantics of text are relatively certain. In this work, we propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens, combining techniques to realize this potential and enable simultaneous prediction of multiple subsequent tokens. Our approach introduces several key innovations: (1) a masked-input formulation where multiple future tokens are jointly predicted from a common prefix; (2) a gated LoRA formulation that preserves the original LLM's functionality, while equipping it for multi-token prediction; (3) a lightweight, learnable sampler module that generates coherent sequences from the predicted future tokens; (4) a set of auxiliary training losses, including a consistency loss, to enhance the coherence and accuracy of jointly generated tokens; and (5) a speculative generation strategy that expands tokens quadratically in the future while maintaining high fidelity. Our method achieves significant speedups through supervised fine-tuning on pretrained models. For example, it generates code and math nearly 5x faster, and improves general chat and knowledge tasks by almost 2.5x. These gains come without any loss in quality.