Reinforcement Pre-Training

๐Ÿ“… 2025-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Conventional language modeling relies on next-token prediction without explicit reasoning mechanisms or verifiable feedback, limiting its ability to learn robust, interpretable representations. Method: We propose Reinforced Pre-Training (RPT), a novel paradigm that reformulates standard language modeling as a token-level, verifiable reward-driven sequential decision-making task. RPT performs end-to-end, general-purpose reinforcement learning pretraining on massive unlabeled corporaโ€”without human annotations or task-specific supervision. Its core innovation is a computationally efficient, local semantic consistency-based token-level reward function, optimized via policy gradient methods for scalable training. Results: Experiments demonstrate that RPT substantially improves next-token prediction accuracy, with performance scaling consistently with compute. The resulting models not only achieve superior language modeling but also provide high-quality initialization for downstream alignment methods (e.g., RLHF). RPT establishes a scalable, verifiable, and general-purpose pretraining framework grounded in reinforcement learning principles.

Technology Category

Application Category

๐Ÿ“ Abstract
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
Problem

Research questions and friction points this paper is trying to address.

Improving next-token prediction using reinforcement learning
Leveraging text data for general-purpose reinforcement learning
Enhancing language model pre-training with scalable RL methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Pre-Training for language models
Next-token prediction as RL reasoning task
Scalable RL using general text data
๐Ÿ”Ž Similar Papers
No similar papers found.