🤖 AI Summary
Existing knowledge-guided information-seeking dialogue datasets suffer from limitations in scale, multilingual coverage, and spoken-language authenticity. This work proposes HEALTHDIAL—the first large-scale, multilingual, multiparallel, and knowledge-grounded spoken dialogue dataset—constructed from authoritative World Health Organization content, comprising 6,000 dialogues across Arabic, Chinese, English, and Spanish. The dataset includes 163 hours of native-speaker audio recordings along with fine-grained demographic and sociolinguistic annotations. Accompanying the release are an open-source toolkit, a prototype retrieval-augmented generation (RAG) system, and a comprehensive evaluation benchmark. Analyses reveal significant performance disparities in information-seeking dialogue even among high-resource languages, establishing HEALTHDIAL as a critical resource and presenting new challenges for cross-lingual dialogue system research.
📝 Abstract
Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.