Residual Cross-Modal Fusion Networks for Audio-Visual Navigation

📅 2026-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in audio-visual embodied navigation arising from heterogeneous feature interactions, modality dominance, and information degradation—particularly in cross-domain scenarios. To overcome these issues, the authors propose the Cross-modal Residual Fusion Network (CRFN), which employs a bidirectional residual interaction mechanism to enable fine-grained alignment and complementary modeling while preserving the independence of audio and visual representations. A training stabilization strategy is further introduced to enhance convergence and robustness. Experimental results demonstrate that CRFN significantly outperforms existing fusion approaches on the Replica and Matterport3D datasets, exhibiting superior cross-domain generalization. The study also reveals dynamic variations in the agent’s reliance on individual modalities across different environments.

Technology Category

Application Category

📝 Abstract
Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.
Problem

Research questions and friction points this paper is trying to address.

audio-visual navigation
multimodal fusion
cross-modal interaction
embodied AI
cross-domain generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Residual Fusion
Audio-Visual Navigation
Bidirectional Residual Interaction
Multimodal Fusion
Embodied AI
🔎 Similar Papers
No similar papers found.
Y
Yi Wang
School of Computer Science and Technology, Xinjiang University, Urumqi, China
Yinfeng Yu
Yinfeng Yu
Associate Professor, Xinjiang University
Embodied intelligence
B
Bin Ren
School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China