Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) methods are largely confined to single domains (e.g., mathematics) and rely on online training with verifiable rewards, severely limiting exploration space and hindering cross-domain reasoning. This work proposes RGR-GRPO, the first RL framework to incorporate scoring rubrics—enabling fine-grained, dense, cross-domain reward modeling jointly optimized with offline policy learning. Built upon the GRPO algorithm, RGR-GRPO integrates rubric-driven reward modeling with offline training, substantially improving exploration stability and generalization of large language models (LLMs) on complex, multi-domain reasoning tasks. Evaluated across 14 cross-domain benchmarks, it achieves average accuracy gains of +7.0% (mathematics), +5.4% (physics), +8.4% (chemistry), and +6.6% (general reasoning), alongside consistent improvements in pass@k performance.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose $ extbf{RGR-GRPO}$ (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.
Problem

Research questions and friction points this paper is trying to address.

Limited exploration space in single-domain reinforcement learning for reasoning
Reliance on purely online RL frameworks restricting reasoning performance
Need for fine-grained reward signals and guidance in multi-domain reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-driven RL framework for multi-domain reasoning
Provides fine-grained rewards and offline guidance
Enables larger solution space exploration during training
🔎 Similar Papers
No similar papers found.