🤖 AI Summary
To address the lack of open-source, multi-channel, multi-speaker Mandarin speech datasets for in-vehicle scenarios, this work introduces the first publicly available ASR benchmark dataset for automotive environments—comprising over 100 hours of real-world driving recordings (4-channel far-field/near-field microphones) and 40 hours of in-vehicle noise. Methodologically, we integrate multi-channel acquisition, blind source separation (BSS), speaker diarization, and end-to-end ASR, augmented with noise-based simulation for enhanced robustness. Our key contributions are: (1) the first reproducible, publicly released multi-speaker in-vehicle speech dataset and evaluation benchmark; (2) a joint separation-and-recognition modeling framework that effectively mitigates challenges including in-cabin reverberation, high-level background noise, and overlapping speech; and (3) empirical analysis demonstrating significant performance degradation of mainstream ASR models in automotive settings, alongside an open-sourced baseline system enabling reproducible evaluation on both multi-speaker recognition and role-aware transcription tasks.
📝 Abstract
This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark.