AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of open-source, multi-channel, multi-speaker Mandarin speech datasets for in-vehicle scenarios, this work introduces the first publicly available ASR benchmark dataset for automotive environments—comprising over 100 hours of real-world driving recordings (4-channel far-field/near-field microphones) and 40 hours of in-vehicle noise. Methodologically, we integrate multi-channel acquisition, blind source separation (BSS), speaker diarization, and end-to-end ASR, augmented with noise-based simulation for enhanced robustness. Our key contributions are: (1) the first reproducible, publicly released multi-speaker in-vehicle speech dataset and evaluation benchmark; (2) a joint separation-and-recognition modeling framework that effectively mitigates challenges including in-cabin reverberation, high-level background noise, and overlapping speech; and (3) empirical analysis demonstrating significant performance degradation of mainstream ASR models in automotive settings, alongside an open-sourced baseline system enabling reproducible evaluation on both multi-speaker recognition and role-aware transcription tasks.

Technology Category

Application Category

📝 Abstract
This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark.
Problem

Research questions and friction points this paper is trying to address.

Creating first open-source in-car multi-speaker Mandarin ASR dataset
Addressing challenges in speech diarization and recognition in driving scenarios
Providing benchmark for ASR models in complex multi-channel environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-channel far-field and near-field speech recording
Speech source separation for clean speech extraction
Open-access reproducible baseline ASR system
🔎 Similar Papers
No similar papers found.
Y
Yuhang Dai
Audio, Speech and Language Processing Group(ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
H
He Wang
Audio, Speech and Language Processing Group(ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
X
Xingchen Li
Audio, Speech and Language Processing Group(ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Z
Zihan Zhang
Audio, Speech and Language Processing Group(ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
S
Shuiyuan Wang
Audio, Speech and Language Processing Group(ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Lei Xie
Audio, Speech and Language Processing Group(ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
X
Xin Xu
Beijing AISHELL Technology Co., Ltd., Beijing, China
H
Hongxiao Guo
Beijing AISHELL Technology Co., Ltd., Beijing, China
S
Shaoji Zhang
Beijing AISHELL Technology Co., Ltd., Beijing, China
Hui Bu
Hui Bu
aishell
Speech Recognition、Speech databases and text corpora、Special topics on speech databases and
W
Wei Chen
Li Auto Inc., Beijing, China