SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

πŸ“… 2026-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that existing task-oriented dialogue systems struggle to model authentic user behavior due to the scarcity of large-scale, diverse spoken dialogue data. To this end, the authors present SpokenTOD, a novel multi-domain spoken task-oriented dialogue dataset comprising 52,390 dialogue turns and 1,034 hours of speechβ€”the first systematically constructed resource of its kind. They further introduce SpokenUS, a new spoken user simulator capable of generating four characteristic spoken behaviors, including barge-in and incremental slot-value revelation, thereby better approximating human-like interaction patterns. Experimental results demonstrate that SpokenUS achieves target coverage comparable to that of substantially larger models, while significantly outperforming baseline simulators in human-rated MOS scores. Moreover, the complex spoken behaviors it generates pose a meaningful and effective challenge for downstream dialogue agents.

Technology Category

Application Category

πŸ“ Abstract
Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.
Problem

Research questions and friction points this paper is trying to address.

spoken dialogue systems
task-oriented dialogue
user simulation
spoken user behaviors
dialogue dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken user simulator
task-oriented dialogue
barge-in modeling
dialogue data augmentation
human-like slot disclosure
πŸ”Ž Similar Papers
No similar papers found.
J
Jonggeun Lee
Graduate School of Data Science, Seoul National University
J
Junseong Pyo
Department of Information Systems, Hanyang University
J
Jeongmin Park
Department of Computer Science and Engineering, Seoul National University
Yohan Jo
Yohan Jo
Seoul National University
Natural Language ProcessingAgentsComputational PsychologyReasoning