🤖 AI Summary
This work addresses the challenging problem of single-channel target speech extraction (TSE) in reverberant environments. We propose the first method that relies solely on geometric distance information between speakers and the microphone—eschewing speaker identity cues (e.g., speaker embeddings) and physiological features (e.g., pitch)—thereby departing from conventional content- or identity-based modeling paradigms. Our approach fuses distance encoding with time-frequency features within a lightweight, end-to-end differentiable network that jointly estimates distances for multiple speakers. Evaluated in both intra-room and cross-room scenarios, it significantly outperforms baseline methods, achieving a distance root-mean-square error of only 0.15 m and enabling real-time online demonstration. This work establishes, for the first time, a purely distance-driven paradigm for single-channel TSE, opening a new pathway toward unsupervised and privacy-preserving speech separation.
📝 Abstract
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.