CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address degraded automatic speech recognition (ASR) performance caused by overlapping speech in in-vehicle scenarios, this paper proposes a lightweight real-time speech separation method. It leverages a distributed heterogeneous microphone array, integrates channel-aware spatial feature extraction with hybrid (simulated + real-world) impulse response data augmentation to enhance source localization and neural mask estimation accuracy, and incorporates MVDR beamforming during inference to suppress distortion. Key innovations include heterogeneous array modeling, a hybrid impulse response data augmentation strategy, and a mask-MVDR co-optimization framework. Experiments on a real in-vehicle recording dataset demonstrate that the proposed method achieves a 17.5% relative reduction in ASR word error rate compared to DualSep, with only 0.4 GMACs computational cost—significantly improving frontend robustness and real-time capability.

Technology Category

Application Category

📝 Abstract

Separating overlapping speech from multiple speakers is crucial for effective human-vehicle interaction. This paper proposes CabinSep, a lightweight neural mask-based minimum variance distortionless response (MVDR) speech separation approach, to reduce speech recognition errors in back-end automatic speech recognition (ASR) models. Our contributions are threefold: First, we utilize channel information to extract spatial features, which improves the estimation of speech and noise masks. Second, we employ MVDR during inference, reducing speech distortion to make it more ASR-friendly. Third, we introduce a data augmentation method combining simulated and real-recorded impulse responses (IRs), improving speaker localization at zone boundaries and further reducing speech recognition errors. With a computational complexity of only 0.4 GMACs, CabinSep achieves a 17.5% relative reduction in speech recognition error rate in a real-recorded dataset compared to the state-of-the-art DualSep model. Demos are available at: https://cabinsep.github.io/cabinsep/.

Problem

Research questions and friction points this paper is trying to address.

Separating overlapping speech from multiple speakers

Reducing speech recognition errors in ASR models

Improving speaker localization at zone boundaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

MVDR beamforming with neural masks

IR-augmented data for localization

Lightweight real-time speech separation

🔎 Similar Papers

No similar papers found.