Pretraining with Token-Level Adaptive Latent Chain-of-Thought

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck of large language models caused by the scarcity of high-quality data and escalating communication overhead. To overcome these limitations, the authors propose a token-level adaptive implicit chain-of-thought mechanism that dynamically generates variable-length reasoning trajectories during single-stage general pretraining—without requiring explicit annotations—and adaptively allocates computational resources based on task difficulty. Built upon the Llama architecture, the approach integrates implicit chain-of-thought generation with a token-level halting strategy, achieving significantly lower language modeling perplexity at reduced training FLOPs while consistently improving accuracy across a range of downstream tasks.

Technology Category

Application Category

📝 Abstract
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.
Problem

Research questions and friction points this paper is trying to address.

large language models
pretraining
computation efficiency
Chain-of-Thought
token-level adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive latent Chain-of-Thought
token-level computation
pretraining efficiency
variable-length reasoning
adaptive halting
🔎 Similar Papers
No similar papers found.
B
Boyi Zeng
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yiqin Hao
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
He Li
He Li
Postdoc, Institute of Natural Science, Shanghai Jiao Tong University
active matterfluid mechanicspattern formation
S
Shixiang Song
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
F
Feichen Song
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
Z
Zitong Wang
Sun Yat-sen University
Siyuan Huang
Siyuan Huang
Shanghai Jiao Tong University
large language model
Yi Xu
Yi Xu
Shanghai Jiao Tong University
Data MiningNatural Language ProcessingKnowledge Engineering
Z
ZiWei He
Shanghai Innovation Institute
X
Xinbing Wang
Shanghai Jiao Tong University
Z
Zhouhan Lin
LUMIA Lab, School of Artificial Intelligence, Shanghai Jiao Tong University