Pretraining with Token-Level Adaptive Latent Chain-of-Thought

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the performance bottleneck of large language models caused by the scarcity of high-quality data and escalating communication overhead. To overcome these limitations, the authors propose a token-level adaptive implicit chain-of-thought mechanism that dynamically generates variable-length reasoning trajectories during single-stage general pretraining—without requiring explicit annotations—and adaptively allocates computational resources based on task difficulty. Built upon the Llama architecture, the approach integrates implicit chain-of-thought generation with a token-level halting strategy, achieving significantly lower language modeling perplexity at reduced training FLOPs while consistently improving accuracy across a range of downstream tasks.

Technology Category

Application Category

📝 Abstract

Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.

Problem

Research questions and friction points this paper is trying to address.

large language models

pretraining

computation efficiency

Chain-of-Thought

token-level adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive latent Chain-of-Thought

token-level computation

pretraining efficiency