Audio-Guided Visual Perception for Audio-Visual Navigation

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual navigation methods suffer from poor generalization to unseen sound sources or novel environments, resulting in low success rates and inefficient, overly long trajectories. This limitation stems primarily from the absence of explicit cross-modal alignment between auditory signals and visual regions, causing policies to rely on spurious acoustic-fingerprint–scene correlations and engage in blind exploration. To address this, we propose an audio-guided visual perception framework that transforms sounds from memorizable acoustic fingerprints into spatially informative guidance signals. Specifically, we model global auditory context via audio self-attention and introduce a cross-modal attention mechanism that dynamically modulates visual features and reweights visual regions conditioned on auditory context. This design enforces interpretable, context-aware cross-modal alignment. Experiments demonstrate substantial improvements in navigation success rate and path efficiency, alongside strong generalization across unseen scenes and previously unencountered sound sources.

Technology Category

Application Category

📝 Abstract
Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.
Problem

Research questions and friction points this paper is trying to address.

Improving agent navigation to unheard sound sources in 3D environments
Addressing poor cross-source generalization in audio-visual navigation systems
Reducing dependency on memorized acoustic fingerprints for navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio self-attention extracts global auditory context
Auditory context guides visual feature attention spatially
Cross-modal alignment reduces acoustic fingerprint dependency
🔎 Similar Papers
No similar papers found.
Y
Yi Wang
School of Computer Science and Technology, Xinjiang University, Urumqi, China
Yinfeng Yu
Yinfeng Yu
Associate Professor, Xinjiang University
Embodied intelligence
F
Fuchun Sun
Tsinghua University, Beijing, China
L
Liejun Wang
School of Computer Science and Technology, Xinjiang University, Urumqi, China
W
Wendong Zheng
Tianjin University of Technology, Tianjin, China