DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing role-playing evaluations are largely confined to single-turn interactions, making it difficult to assess and improve agents’ role consistency and dialogue quality over extended multi-turn conversations. To address this limitation, this work proposes DynSess, the first end-to-end framework for conversation-level role-playing evaluation and optimization. It introduces a rule-based conversation-level evaluator, DynSess-Eval, which scores complete dialogues and generates reward signals to construct high-quality training trajectories. Leveraging these rewards, the framework trains a role-playing agent, DynSess-Character, using multi-turn lookahead search combined with two complementary policy optimization algorithms: off-policy DSPO and on-policy GSRPO. Experimental results demonstrate that DynSess-Eval scores align closely with human judgments, and DynSess-Character outperforms the strongest baselines—even with fewer parameters—achieving significant improvements in long-horizon role consistency and interactive capability.

📝 Abstract

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

role-playing agents

session-level evaluation

long-horizon quality

character consistency

multi-turn dialogue

Innovation

Methods, ideas, or system contributions that make the work stand out.

session-level evaluation

role-playing agents

long-horizon dialogue