🤖 AI Summary
Reinforcement learning (RL) has long been confined to domain-specific post-training, hindering its application in general-purpose reasoning pretraining. Method: We propose PretrainZero—a novel RL-driven pretraining paradigm that operates on large-scale, unlabeled corpora (e.g., Wikipedia) without human annotations or reward models. Its core innovation is a self-supervised reasoning policy learning mechanism that actively identifies high-information, challenging masked fragments to enable end-to-end RL pretraining of foundation models (3B–30B parameters). Contribution/Results: Applied to Qwen3-4B-Base, PretrainZero yields substantial gains: +8.43 on MMLU-Pro, +5.96 on SuperGPQA, and +10.60 on mathematical benchmarks—demonstrating markedly improved generalization and reasoning capabilities. Moreover, the resulting model serves as a robust reasoning backbone for downstream RL-based verification and reasoning (RLVR) tasks.
📝 Abstract
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.