End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of target speaker extraction in reverberant environments with overlapping speakers and directional noise, using multi-microphone arrays. We propose an end-to-end deep learning framework that jointly models multi-channel acoustic features and spatial cues. Crucially, we introduce instantaneous relative transfer functions (RTFs) as a novel spatial representation—replacing conventional direction-of-arrival (DOA) estimation and spectral embedding—and directly estimate time-varying RTFs from reference speech. This enables more precise spatial discrimination under realistic acoustic conditions. Evaluated on standard benchmarks, our method achieves a 3.2 dB improvement in SI-SNRi over DOA-based baselines and a 5.7 dB gain over spectral embedding baselines. These results demonstrate the effectiveness and robustness of instantaneous RTFs for speaker extraction in complex, reverberant, multi-speaker scenarios with directional interference.

Technology Category

Application Category

📝 Abstract
This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.
Problem

Research questions and friction points this paper is trying to address.

Extracts desired speaker from multiple sources
Utilizes relative transfer functions for extraction
Compares RTF and DOA spatial cues effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses relative transfer functions
Compares RTF and DOA cues
Enhances speaker extraction performance
🔎 Similar Papers