End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This paper addresses the challenging problem of target speaker extraction in reverberant environments with overlapping speakers and directional noise, using multi-microphone arrays. We propose an end-to-end deep learning framework that jointly models multi-channel acoustic features and spatial cues. Crucially, we introduce instantaneous relative transfer functions (RTFs) as a novel spatial representation—replacing conventional direction-of-arrival (DOA) estimation and spectral embedding—and directly estimate time-varying RTFs from reference speech. This enables more precise spatial discrimination under realistic acoustic conditions. Evaluated on standard benchmarks, our method achieves a 3.2 dB improvement in SI-SNRi over DOA-based baselines and a 5.7 dB gain over spectral embedding baselines. These results demonstrate the effectiveness and robustness of instantaneous RTFs for speaker extraction in complex, reverberant, multi-speaker scenarios with directional interference.

Technology Category

Application Category

📝 Abstract

This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.

Problem

Research questions and friction points this paper is trying to address.

Extracts desired speaker from multiple sources

Utilizes relative transfer functions for extraction

Compares RTF and DOA spatial cues effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses relative transfer functions

Compares RTF and DOA cues

Enhances speaker extraction performance

🔎 Similar Papers

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction