Semantic Audio-Visual Navigation in Continuous Environments

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations in existing audio-visual navigation methods, which rely on precomputed room impulse responses, restrict agent movement to discrete grids, and struggle with intermittent target silence leading to perceptual gaps. To overcome these challenges, the authors propose MAGNet, a novel model enabling semantic audio-visual navigation in continuous 3D environments. MAGNet allows agents to move freely while perceiving spatiotemporally consistent multimodal streams. Built upon a multimodal Transformer architecture, it fuses binaural audio, visual observations, egomotion cues, and historical trajectories, leveraging a memory-augmented mechanism to jointly model spatial and semantic target representations. Experimental results demonstrate that MAGNet outperforms the current state-of-the-art by 12.1% in success rate and exhibits significantly enhanced robustness and generalization under short-duration sound emissions and long-range navigation scenarios.

Technology Category

Application Category

📝 Abstract
Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Navigation
Continuous Environments
Intermittent Sound
Goal Information Loss
Semantic Navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous audio-visual navigation
memory-augmented reasoning
multimodal transformer
spatially coherent audio rendering
semantic goal representation
🔎 Similar Papers
No similar papers found.
Y
Yichen Zeng
Wuhan University
H
Hebaixu Wang
Wuhan University
M
Meng Liu
Zhongguancun Academy
Y
Yu Zhou
Nankai University
C
Chen Gao
Zhongguancun Academy
K
Kehan Chen
CASIA
Gongping Huang
Gongping Huang
Professor, Wuhan University, Wuhan, China
Acoustic Signal ProcessingMicrophone ArraysSpeech EnhancementNoise Reduction