PretrainZero: Reinforcement Active Pretraining

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) has long been confined to domain-specific post-training, hindering its application in general-purpose reasoning pretraining. Method: We propose PretrainZero—a novel RL-driven pretraining paradigm that operates on large-scale, unlabeled corpora (e.g., Wikipedia) without human annotations or reward models. Its core innovation is a self-supervised reasoning policy learning mechanism that actively identifies high-information, challenging masked fragments to enable end-to-end RL pretraining of foundation models (3B–30B parameters). Contribution/Results: Applied to Qwen3-4B-Base, PretrainZero yields substantial gains: +8.43 on MMLU-Pro, +5.96 on SuperGPQA, and +10.60 on mathematical benchmarks—demonstrating markedly improved generalization and reasoning capabilities. Moreover, the resulting model serves as a robust reasoning backbone for downstream RL-based verification and reasoning (RLVR) tasks.

Technology Category

Application Category

📝 Abstract
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
Problem

Research questions and friction points this paper is trying to address.

Extends RL from domain-specific to general pretraining without verifiable rewards
Enables self-supervised pretraining on general corpus, breaking verification data-wall
Enhances general reasoning by actively selecting and predicting informative content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active pretraining identifies informative content from corpus
Self-supervised learning breaks verification data-wall using RL
Verification scaling enhances reasoning via challenging masked spans
🔎 Similar Papers
No similar papers found.