Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Single-channel target speech extraction (TSE) suffers from poor generalization across acoustic environments—particularly due to overreliance on source distance cues, which fail to adapt to varying room dimensions and reverberation characteristics. To address this, we propose a joint modeling framework that simultaneously encodes distance and room information. Specifically, we introduce differentiable embeddings for both source distance and room acoustics (i.e., room dimensions and reverberation time), enabling distance–environment dual-driven TSE in the time-frequency domain. Furthermore, we incorporate DRR-aware feature enhancement to improve discriminability of acoustic cues. Extensive experiments on both simulated and real-recorded datasets demonstrate significant improvements in separation performance and cross-room robustness, validating the efficacy of joint distance–room modeling. The source code and an interactive demo are publicly available.

Technology Category

Application Category

📝 Abstract

This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available at https://runwushi.github.io/distance-room-demo-page/.

Problem

Research questions and friction points this paper is trying to address.

Achieve single-channel target speech extraction using distance clues

Improve generalization across rooms by incorporating room information

Propose learnable distance and room embedding for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes distance clues for speech extraction

Incorporates room environmental information

Proposes learnable distance and room embedding

🔎 Similar Papers

No similar papers found.