🤖 AI Summary
Current large language models (LLMs) exhibit limitations in long-horizon reasoning, planning, and creative writing, primarily due to the teacher-forced next-token prediction (NTP) paradigm, which struggles to capture long-range dependencies. Multi-token prediction (MTP) only marginally alleviates short-horizon biases and yields limited gains. To address this, we propose Future Summary Prediction (FSP), a novel training objective that introduces an auxiliary head to learn compact representations—such as bag-of-words or embeddings generated by a reverse language model—of future sequence content. FSP explicitly models long-range structural constraints, breaking through the bottlenecks of conventional autoregressive pretraining. It enables efficient training on 3B- and 8B-parameter models. Experiments demonstrate that FSP significantly outperforms both NTP and MTP on long-horizon tasks—including mathematical reasoning and program synthesis—achieving superior planning capability and long-range coherence. To our knowledge, FSP is the first approach to incorporate abstracted future-state modeling into language modeling, thereby enhancing generative consistency and strategic planning.
📝 Abstract
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.