Reinforcement Pre-Training

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Conventional language modeling relies on next-token prediction without explicit reasoning mechanisms or verifiable feedback, limiting its ability to learn robust, interpretable representations. Method: We propose Reinforced Pre-Training (RPT), a novel paradigm that reformulates standard language modeling as a token-level, verifiable reward-driven sequential decision-making task. RPT performs end-to-end, general-purpose reinforcement learning pretraining on massive unlabeled corpora—without human annotations or task-specific supervision. Its core innovation is a computationally efficient, local semantic consistency-based token-level reward function, optimized via policy gradient methods for scalable training. Results: Experiments demonstrate that RPT substantially improves next-token prediction accuracy, with performance scaling consistently with compute. The resulting models not only achieve superior language modeling but also provide high-quality initialization for downstream alignment methods (e.g., RLHF). RPT establishes a scalable, verifiable, and general-purpose pretraining framework grounded in reinforcement learning principles.

Technology Category

Application Category

📝 Abstract

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Problem

Research questions and friction points this paper is trying to address.

Improving next-token prediction using reinforcement learning

Leveraging text data for general-purpose reinforcement learning

Enhancing language model pre-training with scalable RL methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Pre-Training for language models

Next-token prediction as RL reasoning task

Scalable RL using general text data

🔎 Similar Papers

No similar papers found.