Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) approaches for enhancing large language model (LLM) reasoning focus predominantly on mathematics and coding, suffering from scarce and poorly understood cross-domain reward signals. Method: We introduce Guru, the first verifiable cross-domain RL reasoning corpus (92K samples spanning mathematics, code, science, logic, simulation, and tabular reasoning), and propose a domain-aware RL training paradigm integrating domain-specific rewards, intra- and inter-domain fine-tuning, and PPO optimization. Contribution/Results: We demonstrate that RL drives genuine skill acquisition—not mere knowledge activation—in pretraining-underexposed domains (e.g., logic, simulation, tabular reasoning). We release Guru-7B/32B models, achieving SOTA on 17 cross-domain reasoning benchmarks (+7.9%/+6.7% over baselines) and substantially improving Pass@k on complex tasks. All data, models, and code are open-sourced.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
Problem

Research questions and friction points this paper is trying to address.

Lack of reliable RL rewards across diverse reasoning domains
Limited understanding of RL's broader applicability beyond math and code
Need for domain-specific training to achieve meaningful performance gains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain RL training for LLM reasoning
Domain-specific reward design in Guru corpus
State-of-the-art Guru-7B and Guru-32B models
🔎 Similar Papers
No similar papers found.
Zhoujun Cheng
Zhoujun Cheng
UC San Diego
Natural Language ProcessingArtificial Intelligence
Shibo Hao
Shibo Hao
Ph.D. student, UC San Diego
machine learninglarge language model
Tianyang Liu
Tianyang Liu
Ph.D. in Computer Science, UC San Diego
artificial intelligencelarge language modelscode intelligence
F
Fan Zhou
MBZUAI
Y
Yutao Xie
UC San Diego
Feng Yao
Feng Yao
University of California, San Diego
Natural Language Processing
Y
Yuexin Bian
UC San Diego
Yonghao Zhuang
Yonghao Zhuang
Carnegie Mellon University
Distributed SystemsMachine Learning
N
Nilabjo Dey
Purdue University
Yuheng Zha
Yuheng Zha
University of California, San Diego
Natural Language ProcessingVision Language
Y
Yi Gu
UC San Diego
K
Kun Zhou
UC San Diego
Y
Yuqi Wang
MBZUAI
Y
Yuan Li
Carnegie Mellon University
R
Richard Fan
MBZUAI
J
Jianshu She
MBZUAI
Chengqian Gao
Chengqian Gao
MBZUAI
Reinforcement Learning
Abulhair Saparov
Abulhair Saparov
Assistant Professor, Purdue University
Natural Language UnderstandingReasoningNatural Language ProcessingStatistical Machine Learning
H
Haonan Li
MBZUAI
Taylor W. Killian
Taylor W. Killian
Senior Research Scientist, MBZUAI Institute of Foundation Models
Machine LearningReinforcement LearningHealthcareTransfer LearningCausal Inference
Mikhail Yurochkin
Mikhail Yurochkin
Staff AI Scientist, IFM MBZUAI, ex MIT-IBM Watson AI Lab
Machine LearningFoundation ModelsEvaluationModel Fusion
Zhengzhong Liu
Zhengzhong Liu
Institute of Foundation Models
Natural Language ProcessingMachine Learning
E
Eric P. Xing
Carnegie Mellon University
Zhiting Hu
Zhiting Hu
Assistant Professor at UC San Diego
Machine LearningArtificial IntelligenceNatural Language Processing