Directional Source Separation for Robust Speech Recognition on Smart Glasses

πŸ“… 2023-09-20
πŸ›οΈ IEEE International Conference on Acoustics, Speech, and Signal Processing
πŸ“ˆ Citations: 5
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Speech recognition and speaker change detection in smart glasses degrade significantly in noisy environments. Method: This paper proposes a multi-microphone directional speech enhancement method tailored for wearable devices, jointly modeling neural beamforming and multichannel source separation within an end-to-end optimized framework integrating separation and automatic speech recognition (ASR). The approach employs a lightweight Conv-TasNet variant and a differentiable beamformer to enable efficient directional source separation on resource-constrained edge devices. Contribution/Results: Directional separation alone reduces the word error rate (WER) of wearer’s speech by 32%. Joint training further enhances robustness, achieving state-of-the-art ASR performance for smart glasses under real-world noise conditions. Crucially, this work presents the first empirical validation of end-to-end joint optimization of source separation and ASR for wearable audio applications.
πŸ“ Abstract
Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.
Problem

Research questions and friction points this paper is trying to address.

Improve speech recognition in noisy environments
Enhance directional source separation using multi-microphone arrays
Optimize joint training of separation and ASR models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-microphone array for directional source separation
Neural beamforming to learn directional characteristics
Joint training of separation and ASR models
πŸ”Ž Similar Papers
No similar papers found.