🤖 AI Summary
This work addresses the challenges in audio-visual embodied navigation arising from heterogeneous feature interactions, modality dominance, and information degradation—particularly in cross-domain scenarios. To overcome these issues, the authors propose the Cross-modal Residual Fusion Network (CRFN), which employs a bidirectional residual interaction mechanism to enable fine-grained alignment and complementary modeling while preserving the independence of audio and visual representations. A training stabilization strategy is further introduced to enhance convergence and robustness. Experimental results demonstrate that CRFN significantly outperforms existing fusion approaches on the Replica and Matterport3D datasets, exhibiting superior cross-domain generalization. The study also reveals dynamic variations in the agent’s reliance on individual modalities across different environments.
📝 Abstract
Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.