SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Addressing the challenges of poor cross-domain generalization and low interpretability in large language model (LLM) reasoning, this paper proposes the Two-Stage Historical Resampling Policy Optimization (SRPO) method—a pure reinforcement learning (RL) approach that enhances the joint mathematical and coding reasoning performance of Qwen2.5-32B without supervised fine-tuning. Key contributions include: (1) a novel cross-domain collaborative training paradigm integrating curriculum-style domain switching with trajectory-level reward modeling; (2) a dynamic Historical Resampling (HR) mechanism that adaptively selects and reuses high-quality reasoning trajectories; and (3) scalable RLHF training built upon the Group Relative Policy Optimization (GRPO) framework. On AIME24 and LiveCodeBench, SRPO surpasses DeepSeek-R1-Zero-32B, achieving—for the first time on an identical base model—strong, purely RL-driven cross-domain reasoning capability.

Technology Category

Application Category

📝 Abstract

Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which successfully surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B) and relies solely on RL, without prior Supervised Fine-Tuning (SFT). Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, dedicating to offer valuable insights into scaling LLM reasoning capabilities across diverse tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning with cross-domain RL

Overcoming replication challenges without SFT

Balancing math and coding via two-stage training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage cross-domain training paradigm

History Resampling technique

Pure RL without SFT

🔎 Similar Papers

No similar papers found.