L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the weak long-range dependency modeling and low inference efficiency inherent in next-token prediction (NTP) used by large language models (LLMs), this paper proposes Leap-based Multi-Token Prediction (L-MTP), which non-contiguously predicts distant target tokens in a single forward pass—bypassing conventional sequential token generation. Key innovations include: (i) the first-ever “leap” mechanism enabling non-adjacent token prediction; (ii) a non-autoregressive multi-head leap attention; (iii) leap-aware positional encoding; and (iv) a dedicated decoding scheduling strategy. We theoretically establish a lower bound on inference acceleration and empirically demonstrate an average 2.1× speedup across multiple benchmarks, alongside reduced perplexity and improved downstream task accuracy—outperforming both standard NTP and conventional multi-token prediction approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Overcoming sequential constraints in next-token prediction for LLMs

Enhancing long-range dependency capture in token prediction

Accelerating inference speed via non-sequential leap token generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leap-based multi-token prediction mechanism

Skips intermediate tokens for efficiency

Enhances long-range dependencies capture

🔎 Similar Papers

No similar papers found.

Authors to Follow