RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

High-quality Chinese spoken task-oriented dialogue (TOD) datasets are severely lacking—particularly those capturing real-world speech disfluencies, speaker diversity, and multi-turn, cross-domain, speech-text multimodal annotations. Method: We introduce the first high-fidelity Chinese multimodal TOD dataset featuring 5.4k dialogues (150 hours, 60k utterances), recorded from real interactions and meticulously aligned with textual transcriptions that preserve natural pauses, repetitions, and accent variations. We further propose a novel cross-modal chat protocol enabling dynamic switching between speech and text modalities during inference. Contribution/Results: This dataset enables the first systematic evaluation of Chinese spoken large language models on disfluent ASR, speaker adaptation, and cross-domain generalization. Empirical results validate its effectiveness for robustness assessment, establishing a new benchmark resource and evaluation paradigm for Chinese spoken language modeling research.

Technology Category

Application Category

📝 Abstract

In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.

Problem

Research questions and friction points this paper is trying to address.

Lack of Chinese speech-text dialogue datasets with real speech signals

Absence of speech disfluencies and speaker variations in existing datasets

Need for cross-modal interaction analysis in task-oriented dialogue systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese speech-text dual-modal dataset

Cross-modal chat task with dynamic switching

Evaluation of robustness and speaker sensitivity

🔎 Similar Papers

No similar papers found.