Reusing Pre-Training Data at Test Time is a Compute Multiplier

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the low information utilization efficiency of pretraining data by proposing a novel test-time data reuse paradigm. Methodologically, it integrates retrieval-augmented generation (RAG) with test-time compute scaling to dynamically retrieve original pretraining data during inference and optimize context parsing, thereby uncovering knowledge underutilized during pretraining. Experiments demonstrate that test-time retrieval yields performance gains equivalent to a 5× increase in training compute; further context parsing optimization delivers an additional +10 percentage points in accuracy. The approach significantly outperforms baselines on MMLU, Math-500, and SimpleQA—e.g., achieving a +10% absolute gain for LLaMA-3.1-8B on MMLU—while maintaining robustness under contamination-aware evaluation. Crucially, this is the first systematic study to reveal severe information underutilization in mainstream pretraining methods and to establish a scalable, test-time enhancement pathway for efficient knowledge extraction.

Technology Category

Application Category

📝 Abstract

Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.

Problem

Research questions and friction points this paper is trying to address.

Quantifying unused knowledge in pre-training datasets through retrieval

Measuring efficiency gaps in extracting information during model training

Evaluating test-time compute as multiplier for pre-training data utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval augmented generation enhances test-time performance

Reusing pre-training data acts as compute multiplier

Additional test-time compute improves model accuracy significantly

🔎 Similar Papers

An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning