RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenges of reinforcement learning in long-horizon research tasks where verifiable rewards are absent, ground-truth answers are unavailable, decision trajectories are highly complex, and prior experience is difficult to reuse. To overcome these limitations, the authors propose RubricEM, a novel framework that introduces self-generated rubrics as a unified interface for policy execution, feedback, and memory. RubricEM integrates phased policy decomposition, stage-structured GRPO for credit assignment, reflection-driven meta-policy distillation, and a tool-augmented architecture to jointly optimize planning, evidence gathering, evaluation, and synthesis. Experimental results demonstrate that RubricEM-8B significantly outperforms existing open-source models across four long-horizon research benchmarks, achieving performance comparable to proprietary deep research systems.

📝 Abstract

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

Problem

Research questions and friction points this paper is trying to address.

verifiable rewards

long-form research

policy decomposition

reinforcement learning

rubric-guided feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-guided RL

Policy Decomposition

Stage-Structured GRPO