🤖 AI Summary
This work addresses the underexplored challenge of multi-user dialogue state tracking (DST). We present the first systematic extension of single-user DST benchmarks to multi-user settings. To construct multi-speaker evaluation data cost-effectively and controllably—without manual annotation—we propose an automated utterance injection method grounded in speech act theory. Through zero-shot LLM inference and multi-role dialogue structure modeling, we observe a substantial performance degradation (average F1 drop of 23.6%) across mainstream large language models on multi-user DST, exposing their critical limitations in modeling speaker-interaction dynamics. Our contributions include: (1) the first open-source multi-user DST benchmark; (2) a reproducible, theory-informed data generation framework; and (3) a principled failure analysis identifying key bottlenecks in role-aware dialogue understanding. This work establishes foundational resources and insights for advancing robust, multi-role dialogue state tracking.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.