PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
This work addresses the trade-off between role consistency and expressive authenticity in large language models, where enhancing prompt robustness often compromises the fidelity of character portrayal. To mitigate this tension without incurring additional inference overhead, the authors propose PerMix-RLVR, a method that jointly optimizes role stability and expression fidelity through role-mixed training and reinforcement learning with verifiable rewards (RLVR). Experimental results demonstrate that PerMix-RLVR improves role stability by 21.2% on MATH500 and enhances role fidelity by 11.4% on PersonaGym, all while preserving downstream task performance.

Technology Category

Application Category

📝 Abstract
Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.
Problem

Research questions and friction points this paper is trying to address.

persona prompting
reinforcement learning
verifiable rewards
persona expressivity
robustness-fidelity trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

PerMix-RLVR
persona expressivity
reinforcement learning with verifiable rewards
robustness-fidelity trade-off
persona prompting
🔎 Similar Papers
No similar papers found.