ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reward models for role-playing language agents (RPLAs) suffer from limited capacity to capture subjective dialogue preferences, poor generalization, and insufficient scalability. To address these challenges, this work introduces an action-adaptive margin mechanism and a self-evolving training paradigm leveraging unlabeled data. We construct RoleplayPref—the first large-scale, RPLA-specific preference dataset—and RoleplayEval, a dedicated evaluation benchmark. Our method integrates character-level representations, action-aware margin optimization, self-supervised preference distillation, and the Direct Preference Optimization (DPO) framework. Experimental results demonstrate a 13% improvement in preference ranking accuracy over the Bradley–Terry model and establish new state-of-the-art performance on both CharacterEval and RoleplayEval. Moreover, the approach significantly enhances role consistency and interaction authenticity, confirming its effectiveness in modeling nuanced, subjective conversational preferences for RPLAs.

Technology Category

Application Category

📝 Abstract
Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at https://github.com/calubkk/ChARM.
Problem

Research questions and friction points this paper is trying to address.

Improving scalability of reward models for Role-Playing Language Agents
Adapting to subjective conversational preferences in role-playing scenarios
Enhancing learning efficiency and generalizability in human-computer interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Act-adaptive margin enhances learning efficiency
Self-evolution mechanism uses unlabeled data
Large-scale dataset RoleplayPref for RPLAs