From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual speech separation methods predominantly rely on static visual features, failing to fully exploit the cross-modal complementarity between lip motion and dynamic facial cues. To address this, we propose a coarse-to-fine recursive semantic enhancement framework that, for the first time, integrates speaker-aware cross-modal fusion with a multi-range spectro-temporal separation network, enabling dynamic, fine-grained audio-visual semantic alignment and iterative refinement. Our method adopts a two-stage architecture: the coarse stage leverages audio-visual speech recognition priors to initialize separation; the fine stage jointly models time-frequency and visual representations via a speaker-aware perceptual fusion module and a multi-range spectro-temporal network. Evaluated on three mainstream benchmarks and two noisy datasets, our approach achieves state-of-the-art performance, significantly improving separation accuracy and robustness in complex acoustic environments, thereby validating the effectiveness of the recursive enhancement mechanism.

Technology Category

Application Category

📝 Abstract
Audio-visual speech separation aims to isolate each speaker's clean voice from mixtures by leveraging visual cues such as lip movements and facial features. While visual information provides complementary semantic guidance, existing methods often underexploit its potential by relying on static visual representations. In this paper, we propose CSFNet, a Coarse-to-Separate-Fine Network that introduces a recursive semantic enhancement paradigm for more effective separation. CSFNet operates in two stages: (1) Coarse Separation, where a first-pass estimation reconstructs a coarse audio waveform from the mixture and visual input; and (2) Fine Separation, where the coarse audio is fed back into an audio-visual speech recognition (AVSR) model together with the visual stream. This recursive process produces more discriminative semantic representations, which are then used to extract refined audio. To further exploit these semantics, we design a speaker-aware perceptual fusion block to encode speaker identity across modalities, and a multi-range spectro-temporal separation network to capture both local and global time-frequency patterns. Extensive experiments on three benchmark datasets and two noisy datasets show that CSFNet achieves state-of-the-art (SOTA) performance, with substantial coarse-to-fine improvements, validating the necessity and effectiveness of our recursive semantic enhancement framework.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech separation using recursive audio-visual semantic refinement
Improving speaker isolation with coarse-to-fine visual semantic guidance
Addressing underutilized visual cues in audio-visual speech separation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive audio-visual semantic enhancement paradigm
Speaker-aware perceptual fusion across modalities
Multi-range spectro-temporal separation network
🔎 Similar Papers
No similar papers found.
Ke Xue
Ke Xue
Nanjing University
Black-Box OptimizationMachine Learning
Rongfei Fan
Rongfei Fan
Beijing Institute of Technology
Federated LearningEdge ComputingResource AllocationStatistical Signal Processing
L
Lixin
Qilu University of Technology, Jinan 250353, China
D
Dawei Zhao
Shandong Computer Science Center, Jinan 250353, China
C
Chao Zhu
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China
H
Han Hu
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China