CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of ambiguous reward signals and unstable subjective evaluations in open-ended, subjective dialogue tasks (e.g., role-playing), which limit the effectiveness of Reinforcement Learning from Feedback (RLFT), this paper proposes a novel trajectory-level comparative reward modeling paradigm. Instead of relying on absolute per-sample scoring, we introduce CharacterArena—a two-stage evaluation framework that elicits objective pairwise preferences through multi-round contextualized role-play simulations and crowd-sourced comparative annotations. We further design a Comparative Policy Optimization (CPO) algorithm tailored to this paradigm. Extensive experiments on three benchmarks—CharacterEval, CharacterBench, and CharacterArena—demonstrate significant improvements in dialogue quality, while effectively mitigating context bias and reward ambiguity. Our results validate the efficacy and generalizability of preference-based learning for open-ended subjective tasks.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward ambiguity in role-playing dialogue systems
Shifts from sample-wise to comparative group-wise scoring
Mitigates subjective evaluation bias via objective comparisons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative group-wise scoring replaces sample-wise scoring
CharacterArena framework enables trajectory-level comparative evaluation
CPO minimizes reward ambiguity via objective trajectory comparisons
🔎 Similar Papers
No similar papers found.
X
Xinge Ye
Tongyi Lab, Alibaba Group
R
Rui Wang
Tongyi Lab, Alibaba Group
Yuchuan Wu
Yuchuan Wu
Alibaba Tongyi Lab(通义实验室)
Conversational AILarge Language ModelsSocial Intelligence
V
Victor Ma
Independent Researcher
Feiteng Fang
Feiteng Fang
University of Science and Technology of China
LLMNLP
F
Fei Huang
Tongyi Lab, Alibaba Group
Y
Yongbin Li
Tongyi Lab, Alibaba Group