SERL: Self-Examining Reinforcement Learning on Open-Domain

πŸ“… 2025-11-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Open-domain tasks pose significant challenges for reinforcement learning from human feedback (RLHF) and reinforcement learning from verifiable rewards (RLVR) due to their subjective nature and lack of objective ground-truth answers, hindering the acquisition of reliable, externally grounded reward signals. To address this, we propose Self-Evaluation Reinforcement Learning (SERL), a novel framework wherein a large language model (LLM) simultaneously serves as both generator and evaluator, enabling internal, self-supervised optimization via a dual reward mechanism: (i) a Copeland pairwise comparison reward derived from aggregated model responses, and (ii) a self-consistency reward. SERL is the first method to employ the same LLM as its own judge, facilitating unsupervised, closed-loop self-improvement without external annotations. On AlpacaEval 2, SERL boosts Qwen3-8B’s LC win rate from 52.37% to 59.90%, outperforming existing self-improvement approaches and matching the performance of the significantly larger Qwen3-32Bβ€”achieving state-of-the-art results among comparable methods.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses subjectivity challenges in open-domain reinforcement learning tasks
Eliminates dependency on external reward mechanisms for LLM training
Creates self-improving framework where LLM acts as both Actor and Judge
Innovation

Methods, ideas, or system contributions that make the work stand out.

SERL uses LLM as both Actor and Judge
Employs Copeland-style pairwise comparison for rewards
Introduces self-consistency reward to improve judgment reliability
πŸ”Ž Similar Papers
No similar papers found.
W
Weixuan Ou
Zhejiang University, Hangzhou, China
Y
Yanzhao Zheng
Alibaba Group, Hangzhou, China
S
Shuoshuo Sun
Alibaba Group, Hangzhou, China
W
Wei Zhang
Alibaba Group, Hangzhou, China
B
Baohua Dong
Alibaba Group, Hangzhou, China
H
Hangcheng Zhu
Alibaba Group, Hangzhou, China
R
Ruohui Huang
Alibaba Group, Hangzhou, China
G
Gang Yu
Alibaba Group, Hangzhou, China
P
Pengwei Yan
Zhejiang University, Hangzhou, China
Yifan Qiao
Yifan Qiao
Postdoc at University of California, Berkeley
Operating SystemsCloud ComputingML Systems