Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

šŸ“… 2025-10-31
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Large language models (LLMs) frequently exhibit insufficient role consistency when simulating human users—manifesting as persona deviation, contradictory statements, and behavioral incoherence—thereby limiting their deployment in high-stakes interactive domains such as healthcare and education. To address this, we propose the first reinforcement learning (RL) framework explicitly designed for role consistency in multi-turn dialogue. Our method introduces and integrates three computationally tractable consistency metrics—prompt consistency, inter-sentence consistency, and question-answer consistency—as fine-grained reward signals within a RLHF (Reinforcement Learning from Human Feedback) fine-tuning pipeline. Crucially, the framework operates without human annotation and enables end-to-end optimization. Empirical evaluation across three simulated user roles—patient, student, and social partner—demonstrates a >55% reduction in inconsistency instances, yielding substantial improvements in dialogue coherence, behavioral stability, and persona fidelity.

Technology Category

Application Category

šŸ“ Abstract
Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.
Problem

Research questions and friction points this paper is trying to address.

Simulating consistent human personas in multi-turn dialogues
Reducing persona drift and contradictions in LLM behavior
Improving persona consistency through reinforcement learning metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn reinforcement learning fine-tunes LLMs
Three automatic metrics evaluate persona consistency
Framework reduces persona drift by over 55%
šŸ”Ž Similar Papers
No similar papers found.