CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address degraded automatic speech recognition (ASR) performance caused by overlapping speech in in-vehicle scenarios, this paper proposes a lightweight real-time speech separation method. It leverages a distributed heterogeneous microphone array, integrates channel-aware spatial feature extraction with hybrid (simulated + real-world) impulse response data augmentation to enhance source localization and neural mask estimation accuracy, and incorporates MVDR beamforming during inference to suppress distortion. Key innovations include heterogeneous array modeling, a hybrid impulse response data augmentation strategy, and a mask-MVDR co-optimization framework. Experiments on a real in-vehicle recording dataset demonstrate that the proposed method achieves a 17.5% relative reduction in ASR word error rate compared to DualSep, with only 0.4 GMACs computational cost—significantly improving frontend robustness and real-time capability.

Technology Category

Application Category

📝 Abstract
Separating overlapping speech from multiple speakers is crucial for effective human-vehicle interaction. This paper proposes CabinSep, a lightweight neural mask-based minimum variance distortionless response (MVDR) speech separation approach, to reduce speech recognition errors in back-end automatic speech recognition (ASR) models. Our contributions are threefold: First, we utilize channel information to extract spatial features, which improves the estimation of speech and noise masks. Second, we employ MVDR during inference, reducing speech distortion to make it more ASR-friendly. Third, we introduce a data augmentation method combining simulated and real-recorded impulse responses (IRs), improving speaker localization at zone boundaries and further reducing speech recognition errors. With a computational complexity of only 0.4 GMACs, CabinSep achieves a 17.5% relative reduction in speech recognition error rate in a real-recorded dataset compared to the state-of-the-art DualSep model. Demos are available at: https://cabinsep.github.io/cabinsep/.
Problem

Research questions and friction points this paper is trying to address.

Separating overlapping speech from multiple speakers
Reducing speech recognition errors in ASR models
Improving speaker localization at zone boundaries
Innovation

Methods, ideas, or system contributions that make the work stand out.

MVDR beamforming with neural masks
IR-augmented data for localization
Lightweight real-time speech separation
🔎 Similar Papers
No similar papers found.
Runduo Han
Runduo Han
Dalian University of technology
Y
Yanxin Hu
Shanghai ZEEKR Blue New Energy Technology Co., Ltd., China
Yihui Fu
Yihui Fu
Technische Universität Braunschweig
Speech Processing
Z
Zihan Zhang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Y
Yukai Jv
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Li Chen
Shanghai ZEEKR Blue New Energy Technology Co., Ltd., China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China