OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of high-quality pretraining data exhaustion—often termed the “data wall”—by proposing OPUS, a novel dynamic data selection framework that overcomes the limitations of static or optimizer-agnostic strategies. OPUS uniquely integrates modern optimizer update mechanisms into data utility estimation by projecting optimizer-induced effective updates onto the target task direction. Leveraging Ghost gradient approximation, CountSketch compression, and Boltzmann sampling, OPUS achieves highly efficient, scalable, and diverse data selection with only 4.7% additional computational overhead. Experiments demonstrate that OPUS enables GPT-2 Large/XL models to surpass industrial-scale baselines—and even full 200B-token training—with just 30B tokens. Furthermore, in continued pretraining of Qwen3-8B-Base on scientific corpora, OPUS outperforms full-data training using 3B tokens with merely 0.5B tokens.

Technology Category

Application Category

📝 Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

Problem

Research questions and friction points this paper is trying to address.

data selection

large language model

pre-training

data efficiency

training dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic data selection

optimizer-induced utility

projected gradient