SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of poor cross-domain generalization and low interpretability in large language model (LLM) reasoning, this paper proposes the Two-Stage Historical Resampling Policy Optimization (SRPO) method—a pure reinforcement learning (RL) approach that enhances the joint mathematical and coding reasoning performance of Qwen2.5-32B without supervised fine-tuning. Key contributions include: (1) a novel cross-domain collaborative training paradigm integrating curriculum-style domain switching with trajectory-level reward modeling; (2) a dynamic Historical Resampling (HR) mechanism that adaptively selects and reuses high-quality reasoning trajectories; and (3) scalable RLHF training built upon the Group Relative Policy Optimization (GRPO) framework. On AIME24 and LiveCodeBench, SRPO surpasses DeepSeek-R1-Zero-32B, achieving—for the first time on an identical base model—strong, purely RL-driven cross-domain reasoning capability.

Technology Category

Application Category

📝 Abstract
Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which successfully surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B) and relies solely on RL, without prior Supervised Fine-Tuning (SFT). Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, dedicating to offer valuable insights into scaling LLM reasoning capabilities across diverse tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning with cross-domain RL
Overcoming replication challenges without SFT
Balancing math and coding via two-stage training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage cross-domain training paradigm
History Resampling technique
Pure RL without SFT
🔎 Similar Papers
No similar papers found.
X
Xiaojiang Zhang
J
Jinghui Wang
Z
Zifei Cheng
Wenhao Zhuang
Wenhao Zhuang
Kuaishou Technology
Natural Language Processing
Z
Zheng Lin
Minglei Zhang
Minglei Zhang
Assistant Professor, University of Macau
Data ConverterADC-based Optical RXMixed-Signal ML
S
Shaojie Wang
Y
Yinghan Cui
C
Chao Wang
J
Junyi Peng
S
Shimiao Jiang
S
Shiqi Kuang
S
Shouyu Yin
C
Chaohang Wen
H
Haotian Zhang
B
Bin Chen
B
Bing Yu