🤖 AI Summary
This work addresses the performance bottleneck of large language models caused by the scarcity of high-quality data and escalating communication overhead. To overcome these limitations, the authors propose a token-level adaptive implicit chain-of-thought mechanism that dynamically generates variable-length reasoning trajectories during single-stage general pretraining—without requiring explicit annotations—and adaptively allocates computational resources based on task difficulty. Built upon the Llama architecture, the approach integrates implicit chain-of-thought generation with a token-level halting strategy, achieving significantly lower language modeling perplexity at reduced training FLOPs while consistently improving accuracy across a range of downstream tasks.
📝 Abstract
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.