Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) for language models suffers from a data bottleneck—existing RL datasets are small-scale and lack domain diversity, paling in comparison to the vast, heterogeneous corpora used in pretraining. Method: We introduce the first scalable, web-scale RL data engine that automatically transforms massive general-purpose pretraining text into high-quality, multi-domain, verifiable question-answer pairs for RL training. Our engine integrates domain-diverse question generation with answer consistency verification to ensure fidelity and coverage. Contribution/Results: It yields Webscale-RL, a dataset of 1.2 million QA pairs spanning nine domains. Experiments show RL fine-tuning on Webscale-RL achieves over 100× higher training efficiency than continual pretraining, substantially narrows the gap between inference-time behavior and training objectives, and outperforms state-of-the-art baselines across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$ imes$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
Problem

Research questions and friction points this paper is trying to address.

Bridging the training-generation gap in LLMs through reinforcement learning
Overcoming the data bottleneck for RL with diverse web-scale datasets
Creating scalable RL training pipelines for efficient language model development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline converts pretraining documents to RL data
Generates verifiable question-answer pairs automatically
Enables efficient RL training with fewer tokens
🔎 Similar Papers
No similar papers found.