SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

📅 2023-05-22
🏛️ Neural Information Processing Systems
📈 Citations: 36
Influential: 4
📄 PDF
🤖 AI Summary
Existing task-oriented dialogue (TOD) models are predominantly trained on manually curated, written-language datasets and thus struggle with real-world spoken dialogue challenges—including ASR errors, word-level disfluencies, cross-turn coreference, and implicit reasoning. To address this gap, we introduce SpokenWOZ, the first large-scale human-to-human speech-text TOD dataset: covering 8 domains, 5.7K spoken dialogues, and 249 hours of high-fidelity audio, it systematically models spoken interaction characteristics. We propose two novel tasks—cross-turn slot detection and reasoning-based slot detection—and support evaluation in text-only, speech+text multimodal, and LLM-based settings. Experiments reveal that state-of-the-art dialogue state tracking (DST) models achieve only 25.65% joint goal accuracy, while end-to-end task completion reaches merely 52.1%, underscoring the substantial difficulty of spoken TOD modeling and establishing SpokenWOZ as a critical benchmark for future research.
📝 Abstract
Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Bridges gap between written and spoken task-oriented dialogue research
Addresses unique challenges in spoken conversation processing
Introduces new spoken language reasoning tasks for evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale speech-text dataset SpokenWOZ
Cross-turn slot detection challenge
Dual-modal models for spoken TOD
🔎 Similar Papers
No similar papers found.
Shuzheng Si
Shuzheng Si
Tsinghua University
Natural Language ProcessingLarge Language Models
W
Wen-Cheng Ma
DAMO Academy, Alibaba Group
H
Haoyu Gao
DAMO Academy, Alibaba Group
Yuchuan Wu
Yuchuan Wu
Alibaba Tongyi Lab(通义实验室)
Conversational AILarge Language ModelsSocial Intelligence
Ting-En Lin
Ting-En Lin
Alibaba Group, Tongyi
Natural Language ProcessingSpoken Dialogue SystemLarge Language ModelDeep Learning
Yinpei Dai
Yinpei Dai
Tsinghua, Alibaba, UMich
Embodied AIRoboticsDialogue System
H
Hangyu Li
DAMO Academy, Alibaba Group
R
Rui Yan
Gaoling School of Artificial Intelligence, Renmin University of China
F
Fei Huang
DAMO Academy, Alibaba Group
Y
Yongbin Li
DAMO Academy, Alibaba Group