Language Models Improve When Pretraining Data Matches Target Tasks

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the misalignment between pretraining data selection and downstream tasks. We propose BETR, the first method to explicitly align data filtering with target benchmarks by measuring document-task relevance via embedding-space similarity and training a lightweight classifier to efficiently score the entire corpus. A key finding is that larger models require less aggressive data filtering, revealing an intrinsic relationship between data curation and model scaling. BETR demonstrates strong cross-task and cross-scale generalization: it outperforms baselines on 9 out of 10 downstream tasks, achieves an average 2.1× improvement in computational efficiency over DCLM-Baseline, and attains up to 4.7× higher efficiency compared to using unfiltered data.

Technology Category

Application Category

📝 Abstract

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning $10^{19}$ to $10^{22}$ FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.

Problem

Research questions and friction points this paper is trying to address.

Aligning pretraining data improves language model performance

Benchmark-targeted ranking selects data matching evaluation tasks

Optimal data filtering varies with model scale

Innovation

Methods, ideas, or system contributions that make the work stand out.

BETR ranks pretraining data by benchmark similarity

Lightweight classifier predicts document relevance scores

Larger models need less aggressive data filtering

🔎 Similar Papers

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling