OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

πŸ“… 2025-09-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
A critical gap exists in large-scale, multi-speaker, multilingual, and topically diverse conversational speech datasets. To address this, we introduce OleSpeech-IVβ€”the first high-quality, multilingual conversational speech dataset explicitly designed for dialogue understanding and automatic speech recognition, covering English and Arabic and sourced from mainstream public podcasts, interviews, and conference recordings. We propose an end-to-end proprietary processing pipeline integrating automatic speaker diarization, fine-grained transcription, timestamp alignment, confidence scoring, and rigorous human verification to achieve high-precision annotation of speaker roles, turn boundaries, and word-level text. We publicly release the open-source subset OleSpeech-IV-2025-EN-AR-100, comprising 100 hours of meticulously annotated audio, under a non-commercial academic use license. This dataset substantially fills a longstanding void in multilingual conversational speech benchmarks and provides a foundational resource for developing robust, speaker-aware, and linguistically diverse dialogue models.

Technology Category

Application Category

πŸ“ Abstract
OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.
Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale multispeaker multilingual conversational dataset
Providing diverse topic coverage from various speech sources
Ensuring high-quality speaker and transcript annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multispeaker multilingual conversational dataset
Human-sourced speaker turns transcripts refinement pipeline
Open-sourced subset for non-commercial research use
πŸ”Ž Similar Papers
No similar papers found.
W
Wei Chu
Olewave, San Francisco, USA
Yuanzhe Dong
Yuanzhe Dong
Stanford University
NLP
Ke Tan
Ke Tan
Research Scientist, Meta Reality Labs
Speech EnhancementSpeech SeparationMicrophone Array ProcessingModel CompressionDeep Learning
D
Dong Han
Olewave, San Francisco, USA
X
Xavier Menendez-Pidal
Olewave, San Francisco, USA
R
Ruchao Fan
Microsoft
C
Chenfeng Miao
PingAn Technology
C
Chanwoo Kim
Korea University
Bhiksha Raj
Bhiksha Raj
Carnegie Mellon University
Deep LearningArtificial IntelligenceSpeech and Audio ProcessingSignal ProcessingMachine Learning
R
Rita Singh
Carnegie Mellon University