🤖 AI Summary
Existing LLM-based user simulators struggle to maintain goal-directed behavior over multi-turn dialogues, resulting in poor goal alignment and low reliability. To address this, we propose the User Goal State Tracking (UGST) framework—a three-stage goal alignment methodology comprising goal initialization, dynamic goal tracking, and response alignment—complemented by a comprehensive evaluation suite measuring goal retention rate, path consistency, and task completion rate. UGST integrates multi-turn dialogue modeling with explicit goal reasoning, enabling simulators to autonomously track goal progress and generate goal-aligned responses. Evaluated on MultiWOZ 2.4 and τ-Bench, UGST achieves significant improvements in goal consistency (+12.7% average retention rate) and task success rate. This work establishes an interpretable, quantifiable, and reliable paradigm for training and evaluating conversational agents via user simulation.
📝 Abstract
User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations--a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and τ-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.