🤖 AI Summary
In guided source separation (GSS)-based far-field speech enhancement, reference microphone selection critically affects both signal quality and automatic speech recognition (ASR) performance; conventional SNR-based criteria neglect the early-to-late reverberation ratio (ELR), leading to insufficient reverberation suppression. To address this, we propose a joint SNR and normalized ℓ<sub>p</sub> norm criterion for reference microphone selection—the first integration of the ℓ<sub>p</sub> norm into GSS systems—to jointly quantify time-frequency sparsity and reverberation characteristics of microphone signals, enabling coordinated suppression of noise and reverberation. Embedded within the CHiME-8 end-to-end ASR frontend, our method achieves significant reductions in macro-averaged word error rate (WER) under realistic far-field conditions, demonstrating robustness in complex noise-reverberation environments. This work establishes a new perceptually informed paradigm for reference microphone selection in GSS systems.
📝 Abstract
Guided Source Separation (GSS) is a popular front-end for distant automatic speech recognition (ASR) systems using spatially distributed microphones. When considering spatially distributed microphones, the choice of reference microphone may have a large influence on the quality of the output signal and the downstream ASR performance. In GSS-based speech enhancement, reference microphone selection is typically performed using the signal-to-noise ratio (SNR), which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across microphones. In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized $ell_p$-norm, either using only the normalized $ell_p$-norm or combining the normalized $ell_p$-norm and the SNR to account for both differences in SNR and ELR across microphones. Experimental evaluation using a CHiME-8 distant ASR system shows that the proposed $ell_p$-norm-based methods outperform the baseline method, reducing the macro-average word error rate.