RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality Chinese spoken task-oriented dialogue (TOD) datasets are severely lacking—particularly those capturing real-world speech disfluencies, speaker diversity, and multi-turn, cross-domain, speech-text multimodal annotations. Method: We introduce the first high-fidelity Chinese multimodal TOD dataset featuring 5.4k dialogues (150 hours, 60k utterances), recorded from real interactions and meticulously aligned with textual transcriptions that preserve natural pauses, repetitions, and accent variations. We further propose a novel cross-modal chat protocol enabling dynamic switching between speech and text modalities during inference. Contribution/Results: This dataset enables the first systematic evaluation of Chinese spoken large language models on disfluent ASR, speaker adaptation, and cross-domain generalization. Empirical results validate its effectiveness for robustness assessment, establishing a new benchmark resource and evaluation paradigm for Chinese spoken language modeling research.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.
Problem

Research questions and friction points this paper is trying to address.

Lack of Chinese speech-text dialogue datasets with real speech signals
Absence of speech disfluencies and speaker variations in existing datasets
Need for cross-modal interaction analysis in task-oriented dialogue systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese speech-text dual-modal dataset
Cross-modal chat task with dynamic switching
Evaluation of robustness and speaker sensitivity
🔎 Similar Papers
No similar papers found.
Enzhi Wang
Enzhi Wang
Nankai University
Machine learningdata miningnatural language processing
Q
Qicheng Li
TMCC, College of Computer Science, Nankai University
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
Aobo Kong
Aobo Kong
Nankai University
NLPLLM
J
Jiaming Zhou
TMCC, College of Computer Science, Nankai University
X
Xi Yang
Beijing Academy of Artificial Intelligence (BAAI), Beijing, China
Y
Yequan Wang
Beijing Academy of Artificial Intelligence (BAAI), Beijing, China
Y
Yonghua Lin
Beijing Academy of Artificial Intelligence (BAAI), Beijing, China
Y
Yong Qin
TMCC, College of Computer Science, Nankai University