Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited cross-scenario generalization of audio-visual navigation agents, which often stems from overreliance on semantic sound features and training-environment biases. To mitigate this, the authors propose the BDATP framework, which incorporates a Binaural Difference Attention (BDA) module to explicitly model spatial auditory cues, thereby reducing dependence on sound semantics. Additionally, Action Transition Prediction (ATP) is introduced as an auxiliary task to regularize policy learning through joint optimization of perception and decision-making. The method seamlessly integrates into mainstream navigation baselines and achieves substantial performance gains in unseen environments with unheard sounds on both Replica and Matterport3D datasets. Notably, it improves absolute success rate by 21.6 percentage points on Replica, establishing a new state-of-the-art.
📝 Abstract
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Navigation
generalization
unseen environments
overfitting
sound source localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binaural Difference Attention
Action Transition Prediction
Audio-Visual Navigation
Generalization
Spatial Orientation
🔎 Similar Papers
No similar papers found.
J
Jia Li
Joint Research Laboratory for Embodied Intelligence, Xinjiang University; Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University; School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China
Yinfeng Yu
Yinfeng Yu
Associate Professor, Xinjiang University
Embodied intelligence