Pretraining LLM with Latent Thoughts in Continuous Space

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of insufficient computational depth during language model pretraining. We propose integrating a multi-step latent reasoning mechanism operating in continuous state space directly into the pretraining phase: before generating each token, the model performs several internal state evolution steps—termed “latent thinking”—rather than relying solely on post-hoc inference-time techniques (e.g., chain-of-thought). This is the first approach to explicitly embed chain-of-thought–like extended computation into autoregressive pretraining, optimizing the language modeling objective end-to-end. Experiments on the Pythia architecture show that a 1.4B-parameter model with latent thinking outperforms a 2.8B-parameter baseline at equal inference cost, achieving superior performance on both language modeling and multiple downstream tasks. Moreover, increasing the number of latent thinking steps consistently improves results, demonstrating both the effectiveness and scalability of deepening computation during pretraining.

Technology Category

Application Category

📝 Abstract
The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts. Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, ours-1.4B (Pythia Arch), pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token-forming a chain analogous to CoT-consistently improves the model's performance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing token prediction by pretraining with intermediate latent thoughts
Improving LM performance through additional computational steps during pretraining
Scaling latent thoughts analogous to Chain-of-Thought for better generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretraining with latent thoughts in continuous space
Generating intermediate hidden states before token prediction
Using latent thoughts chain to enhance model performance
🔎 Similar Papers
No similar papers found.
B
Boyi Zeng
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
H
He Li
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
S
Shixiang Song
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai Innovation Institute
Y
Yixuan Wang
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai Innovation Institute
Ziwei He
Ziwei He
Shanghai Jiao Tong University
Machine Learning
X
Xinbing Wang
Shanghai Jiao Tong University
Z
Zhouhan Lin
Shanghai AI Laboratory, Shanghai Innovation Institute, LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University