AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the lack of open-source, multi-channel, multi-speaker Mandarin speech datasets for in-vehicle scenarios, this work introduces the first publicly available ASR benchmark dataset for automotive environments—comprising over 100 hours of real-world driving recordings (4-channel far-field/near-field microphones) and 40 hours of in-vehicle noise. Methodologically, we integrate multi-channel acquisition, blind source separation (BSS), speaker diarization, and end-to-end ASR, augmented with noise-based simulation for enhanced robustness. Our key contributions are: (1) the first reproducible, publicly released multi-speaker in-vehicle speech dataset and evaluation benchmark; (2) a joint separation-and-recognition modeling framework that effectively mitigates challenges including in-cabin reverberation, high-level background noise, and overlapping speech; and (3) empirical analysis demonstrating significant performance degradation of mainstream ASR models in automotive settings, alongside an open-sourced baseline system enabling reproducible evaluation on both multi-speaker recognition and role-aware transcription tasks.

Technology Category

Application Category

📝 Abstract

This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark.

Problem

Research questions and friction points this paper is trying to address.

Creating first open-source in-car multi-speaker Mandarin ASR dataset

Addressing challenges in speech diarization and recognition in driving scenarios

Providing benchmark for ASR models in complex multi-channel environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-channel far-field and near-field speech recording

Speech source separation for clean speech extraction

Open-access reproducible baseline ASR system

🔎 Similar Papers

No similar papers found.

Authors to Follow