π€ AI Summary
A critical gap exists in large-scale, multi-speaker, multilingual, and topically diverse conversational speech datasets. To address this, we introduce OleSpeech-IVβthe first high-quality, multilingual conversational speech dataset explicitly designed for dialogue understanding and automatic speech recognition, covering English and Arabic and sourced from mainstream public podcasts, interviews, and conference recordings. We propose an end-to-end proprietary processing pipeline integrating automatic speaker diarization, fine-grained transcription, timestamp alignment, confidence scoring, and rigorous human verification to achieve high-precision annotation of speaker roles, turn boundaries, and word-level text. We publicly release the open-source subset OleSpeech-IV-2025-EN-AR-100, comprising 100 hours of meticulously annotated audio, under a non-commercial academic use license. This dataset substantially fills a longstanding void in multilingual conversational speech benchmarks and provides a foundational resource for developing robust, speaker-aware, and linguistically diverse dialogue models.
π Abstract
OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.