CASPER: A Large Scale Spontaneous Speech Dataset

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Spontaneous speech data is scarce and predominantly scripted, limiting large speech models’ ability to learn authentic human interaction patterns. To address this, we propose the first reproducible paradigm for constructing large-scale spontaneous speech datasets: it integrates context-guided dialogue elicitation protocols, structured multi-device synchronized recording procedures, and a fine-grained metadata annotation framework—enabling high-fidelity acquisition of over 200 hours of natural, multi-topic conversational speech. This paradigm overcomes conventional recording constraints, substantially enhancing authenticity, lexical and pragmatic diversity, and real-time interactivity. The released Stage 1 dataset—the first of its kind—fills a critical gap in high-quality spontaneous speech resources and serves as a foundational benchmark for automatic speech recognition, dialogue modeling, and training of speech foundation models.

Technology Category

Application Category

📝 Abstract
The success of large language models has driven interest in developing similar speech processing capabilities. However, a key challenge is the scarcity of high-quality spontaneous speech data, as most existing datasets contain scripted dialogues. To address this, we present a novel pipeline for eliciting and recording natural dialogues and release our Stage 1 dataset with 200+ hours of spontaneous speech. Our approach fosters fluid, natural conversations while encouraging a diverse range of topics and interactive exchanges. Unlike traditional methods, it facilitates genuine interactions, providing a reproducible framework for future data collection. This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data. We plan to expand this dataset in future stages, offering a growing resource for the research community.
Problem

Research questions and friction points this paper is trying to address.

Scarcity of high-quality spontaneous speech data
Need for natural dialogues in speech processing
Lack of diverse topic interactions in existing datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel pipeline for natural dialogue recording
200+ hours of spontaneous speech dataset
Reproducible framework for future data collection
🔎 Similar Papers
No similar papers found.