π€ AI Summary
Current LLM-based user simulators often exhibit overly cooperative and homogeneous behaviors, failing to capture the diversity and challenge inherent in real users, which undermines agent performance in practical settings. This work proposes Persona Policies (PPol), a novel framework that models user personas as evolvable Python programs. By integrating multi-objective evolutionary algorithms with LLM-guided program synthesis, PPol automatically generates task-consistent, highly realistic, and behaviorally diverse user simulators without manual design. Evaluated on the tauΒ²-bench benchmark, PPol improves simulator realism by 33β62% and achieves an 80.4% human misclassification rate, indicating high behavioral fidelity. Agents trained with PPol demonstrate a 17% relative improvement in task success rates, highlighting its effectiveness in enhancing agent robustness through more authentic user interactions.
π Abstract
Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.