Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit limitations in long-horizon reasoning, planning, and creative writing, primarily due to the teacher-forced next-token prediction (NTP) paradigm, which struggles to capture long-range dependencies. Multi-token prediction (MTP) only marginally alleviates short-horizon biases and yields limited gains. To address this, we propose Future Summary Prediction (FSP), a novel training objective that introduces an auxiliary head to learn compact representations—such as bag-of-words or embeddings generated by a reverse language model—of future sequence content. FSP explicitly models long-range structural constraints, breaking through the bottlenecks of conventional autoregressive pretraining. It enables efficient training on 3B- and 8B-parameter models. Experiments demonstrate that FSP significantly outperforms both NTP and MTP on long-horizon tasks—including mathematical reasoning and program synthesis—achieving superior planning capability and long-range coherence. To our knowledge, FSP is the first approach to incorporate abstracted future-state modeling into language modeling, thereby enhancing generative consistency and strategic planning.

Technology Category

Application Category

📝 Abstract
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improving long-horizon reasoning in language models
Enhancing planning and creative writing capabilities
Addressing limitations of next-token prediction methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts compact future summaries for training
Uses handcrafted and learned summary variants
Improves long-form reasoning over baseline methods
🔎 Similar Papers
No similar papers found.