Semantic Audio-Visual Navigation in Continuous Environments

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses key limitations in existing audio-visual navigation methods, which rely on precomputed room impulse responses, restrict agent movement to discrete grids, and struggle with intermittent target silence leading to perceptual gaps. To overcome these challenges, the authors propose MAGNet, a novel model enabling semantic audio-visual navigation in continuous 3D environments. MAGNet allows agents to move freely while perceiving spatiotemporally consistent multimodal streams. Built upon a multimodal Transformer architecture, it fuses binaural audio, visual observations, egomotion cues, and historical trajectories, leveraging a memory-augmented mechanism to jointly model spatial and semantic target representations. Experimental results demonstrate that MAGNet outperforms the current state-of-the-art by 12.1% in success rate and exhibits significantly enhanced robustness and generalization under short-duration sound emissions and long-range navigation scenarios.

Technology Category

Application Category

📝 Abstract

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Navigation

Continuous Environments

Intermittent Sound

Goal Information Loss

Semantic Navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous audio-visual navigation

memory-augmented reasoning

multimodal transformer