Reasoning to Learn from Latent Thoughts

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During large language model (LLM) pretraining, the scarcity and slow growth of high-quality human-generated text hinder scalable data expansion. Method: This paper proposes explicitly modeling the latent reasoning processes underlying text generation, treating raw web text as compressed outputs of implicit thought and recovering rich, context-aware, reasoning-intensive implicit chains-of-thought (CoT) to enhance data efficiency. It introduces implicit thought modeling—novel for data-constrained pretraining—and designs an unsupervised EM-based bootstrapping framework that iteratively refines both the model’s implicit CoT generation capability and synthetic data quality without strong supervision. The approach integrates synthetic data generation, implicit CoT distillation, and inference-time compute scaling. Results: On MATH, accuracy improves from 5.7% to 25.4% under identical data volume; a 1B-parameter model surpasses baselines after three EM iterations, with sustained performance gains as E-step inference compute increases.

Technology Category

Application Category

📝 Abstract
Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7% $ ightarrow$ 25.4% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.
Problem

Research questions and friction points this paper is trying to address.

Addressing data bottleneck in LM pretraining scaling
Improving pretraining efficiency via latent thought modeling
Enhancing LM performance with synthetic thought-augmented data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling latent thoughts for data efficiency
Synthetic data infers latent reasoning steps
EM algorithm bootstraps LM performance iteratively
🔎 Similar Papers
No similar papers found.