🤖 AI Summary
High experimental costs and difficulties in conducting controlled, multi-condition studies hinder pretraining research for large language models (LLMs). To address this, we propose a “single-training, multiple-experiments” paradigm: ten heterogeneous experiments—including knowledge acquisition, mathematical reasoning, and others—are executed in parallel during a single 1.5B-parameter LLM pretraining run. Leveraging controlled-variable design, dynamic data injection, interactive detection, and contamination analysis, we ensure negligible cross-experiment interference. This approach dramatically improves research efficiency—reproducing established findings and enabling novel explorations—while incurring virtually no additional computational overhead or performance degradation, achieving up to 90% compute savings. Our core contribution is the first systematic realization of a scientific experimentation framework for LLM pretraining that supports concurrent multi-task learning, multi-hypothesis testing, and full reproducibility.
📝 Abstract
Recent work has demonstrated that controlled pretraining experiments are a powerful tool for understanding learning, reasoning, and memorization in large language models (LLMs). However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose to conduct multiple pretraining experiments simultaneously during a single training run. We demonstrate the feasibility of this approach by conducting ten experiments during the training of a 1.5B parameter model on 210B tokens. Although we only train a single model, we can replicate the results from multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until the model acquires a particular piece of knowledge. Remarkably, the influence of the ten experiments on the model's training dynamics and overall performance is minimal. However, interactions between different experiments may act as a potential confounder in our approach. We propose to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our findings suggest that performing multiple pretraining experiments in a single training run can enable rigorous scientific experimentation with large models on a compute budget.