EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of aligning large language models in open-domain generation tasks, where explicit reward signals are typically absent. The authors propose a single-policy co-evolutionary reinforcement learning framework that unifies generation and evaluation by dynamically alternating the model’s role between a reasoner and a score-generating critic. To ensure reward reliability, the framework incorporates multiple validation mechanisms—including a meta-verifier, zero-variance pruning, leave-one-out peer consensus, and a dynamic memory pool—eliminating reliance on static human annotations or external models. This enables autonomous co-evolution of scoring criteria and generation strategies while remaining compatible with expert priors to uncover novel discriminative dimensions. Empirical results across medical, writing, and scientific domains demonstrate substantial improvements over existing alignment methods, with further gains achievable when initialized with expert-provided scores.

📝 Abstract

Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.

Problem

Research questions and friction points this paper is trying to address.

open-ended generation

reinforcement learning

rubric-based alignment

reward signal

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving rubrics

co-evolutionary reinforcement learning

open-ended generation