🤖 AI Summary
This paper addresses the challenging problem of target speaker extraction in reverberant environments with overlapping speakers and directional noise, using multi-microphone arrays. We propose an end-to-end deep learning framework that jointly models multi-channel acoustic features and spatial cues. Crucially, we introduce instantaneous relative transfer functions (RTFs) as a novel spatial representation—replacing conventional direction-of-arrival (DOA) estimation and spectral embedding—and directly estimate time-varying RTFs from reference speech. This enables more precise spatial discrimination under realistic acoustic conditions. Evaluated on standard benchmarks, our method achieves a 3.2 dB improvement in SI-SNRi over DOA-based baselines and a 5.7 dB gain over spectral embedding baselines. These results demonstrate the effectiveness and robustness of instantaneous RTFs for speaker extraction in complex, reverberant, multi-speaker scenarios with directional interference.
📝 Abstract
This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.