OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

A critical gap exists in large-scale, multi-speaker, multilingual, and topically diverse conversational speech datasets. To address this, we introduce OleSpeech-IV—the first high-quality, multilingual conversational speech dataset explicitly designed for dialogue understanding and automatic speech recognition, covering English and Arabic and sourced from mainstream public podcasts, interviews, and conference recordings. We propose an end-to-end proprietary processing pipeline integrating automatic speaker diarization, fine-grained transcription, timestamp alignment, confidence scoring, and rigorous human verification to achieve high-precision annotation of speaker roles, turn boundaries, and word-level text. We publicly release the open-source subset OleSpeech-IV-2025-EN-AR-100, comprising 100 hours of meticulously annotated audio, under a non-commercial academic use license. This dataset substantially fills a longstanding void in multilingual conversational speech benchmarks and provides a foundational resource for developing robust, speaker-aware, and linguistically diverse dialogue models.

Technology Category

Application Category

📝 Abstract

OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.

Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale multispeaker multilingual conversational dataset

Providing diverse topic coverage from various speech sources

Ensuring high-quality speaker and transcript annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multispeaker multilingual conversational dataset

Human-sourced speaker turns transcripts refinement pipeline

Open-sourced subset for non-commercial research use

🔎 Similar Papers

No similar papers found.