Universal pre-training by iterated random computation

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates pretraining large language models exclusively on randomly generated data to enhance zero-shot in-context learning (ICL) capabilities. We propose “Iterative Random Computation” (IRC), a novel pretraining paradigm that theoretically grounds random data generation in Solomonoff induction—demonstrating that effective pretraining is feasible without access to real-world corpora. The method integrates stochastic sequence modeling with large-scale neural network training, and employs rigorous zero-shot evaluation alongside downstream fine-tuning for validation. Experiments show that IRC significantly accelerates fine-tuning convergence and improves cross-dataset generalization. Moreover, zero-shot ICL performance scales consistently with model size, confirming emergent capability alignment. Our approach provides a scalable, theoretically principled alternative to conventional data-dependent pretraining—particularly valuable in data-scarce regimes where collecting or curating high-quality corpora is infeasible.

Technology Category

Application Category

📝 Abstract

We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization.

Problem

Research questions and friction points this paper is trying to address.

Exploring random data for model pre-training theoretically and empirically

Demonstrating zero-shot learning across datasets with synthetic pre-training

Enhancing convergence and generalization via pre-training and fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomly generated data for pre-training

Theoretical justification via algorithmic complexity

Synthetic data enables zero-shot learning

🔎 Similar Papers

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review