Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-visual segmentation (AVS) suffers from inter-modal frequency-domain mismatch: audio high-frequency components are noise-prone, whereas visual high-frequency bands encode structural details—existing methods overlook this discrepancy, compromising robustness. This paper proposes a frequency-aware AVS framework that reformulates segmentation as a frequency-domain decomposition and collaborative recomposition problem. Key contributions include: (1) a frequency-enhanced decomposition module that achieves modality-specific feature disentanglement via residual iterative refinement; and (2) a collaborative cross-modal consistency module leveraging a mixture-of-experts architecture with dynamic routing to jointly model audio-visual frequency characteristics and enforce semantic alignment. The framework achieves state-of-the-art performance on three standard benchmarks. Qualitative analysis demonstrates superior robustness against acoustic noise and visual occlusion, significantly improving segmentation accuracy and generalization in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract
Audio-visual segmentation (AVS) plays a critical role in multimodal machine learning by effectively integrating audio and visual cues to precisely segment objects or regions within visual scenes. Recent AVS methods have demonstrated significant improvements. However, they overlook the inherent frequency-domain contradictions between audio and visual modalities--the pervasively interfering noise in audio high-frequency signals vs. the structurally rich details in visual high-frequency signals. Ignoring these differences can result in suboptimal performance. In this paper, we rethink the AVS task from a deeper perspective by reformulating AVS task as a frequency-domain decomposition and recomposition problem. To this end, we introduce a novel Frequency-Aware Audio-Visual Segmentation (FAVS) framework consisting of two key modules: Frequency-Domain Enhanced Decomposer (FDED) module and Synergistic Cross-Modal Consistency (SCMC) module. FDED module employs a residual-based iterative frequency decomposition to discriminate modality-specific semantics and structural features, and SCMC module leverages a mixture-of-experts architecture to reinforce semantic consistency and modality-specific feature preservation through dynamic expert routing. Extensive experiments demonstrate that our FAVS framework achieves state-of-the-art performance on three benchmark datasets, and abundant qualitative visualizations further verify the effectiveness of the proposed FDED and SCMC modules. The code will be released as open source upon acceptance of the paper.
Problem

Research questions and friction points this paper is trying to address.

Addressing frequency-domain contradictions between audio and visual modalities
Improving audio-visual segmentation by decomposing and recomposing frequency features
Reducing performance degradation from audio noise and visual detail mismatches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses residual-based iterative frequency decomposition
Employs mixture-of-experts for cross-modal consistency
Reformulates segmentation as frequency decomposition-recomposition problem
🔎 Similar Papers
2024-07-18IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 0
Y
Yunzhe Shen
Dalian University of Technology, Dalian 116024, China
Kai Peng
Kai Peng
Associate Professor, IEEE Senior Member, CCF Senior Member, Huaqiao University, China
Service ComputingMobile Edge ComputingComputation Offloading
L
Leiye Liu
Dalian University of Technology, Dalian 116024, China
W
Wei Ji
the School of Medicine, Yale University, New Haven, CT 06520 USA
J
Jingjing Li
the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T5V 1A4, Canada
M
Miao Zhang
Dalian University of Technology, Dalian 116024, China
Y
Yongri Piao
Dalian University of Technology, Dalian 116024, China
H
Huchuan Lu
Dalian University of Technology, Dalian 116024, China