🤖 AI Summary
To address degraded automatic speech recognition (ASR) performance caused by overlapping speech in in-vehicle scenarios, this paper proposes a lightweight real-time speech separation method. It leverages a distributed heterogeneous microphone array, integrates channel-aware spatial feature extraction with hybrid (simulated + real-world) impulse response data augmentation to enhance source localization and neural mask estimation accuracy, and incorporates MVDR beamforming during inference to suppress distortion. Key innovations include heterogeneous array modeling, a hybrid impulse response data augmentation strategy, and a mask-MVDR co-optimization framework. Experiments on a real in-vehicle recording dataset demonstrate that the proposed method achieves a 17.5% relative reduction in ASR word error rate compared to DualSep, with only 0.4 GMACs computational cost—significantly improving frontend robustness and real-time capability.
📝 Abstract
Separating overlapping speech from multiple speakers is crucial for effective human-vehicle interaction. This paper proposes CabinSep, a lightweight neural mask-based minimum variance distortionless response (MVDR) speech separation approach, to reduce speech recognition errors in back-end automatic speech recognition (ASR) models. Our contributions are threefold: First, we utilize channel information to extract spatial features, which improves the estimation of speech and noise masks. Second, we employ MVDR during inference, reducing speech distortion to make it more ASR-friendly. Third, we introduce a data augmentation method combining simulated and real-recorded impulse responses (IRs), improving speaker localization at zone boundaries and further reducing speech recognition errors. With a computational complexity of only 0.4 GMACs, CabinSep achieves a 17.5% relative reduction in speech recognition error rate in a real-recorded dataset compared to the state-of-the-art DualSep model. Demos are available at: https://cabinsep.github.io/cabinsep/.