🤖 AI Summary
This work addresses the limited cross-scenario generalization of audio-visual navigation agents, which often stems from overreliance on semantic sound features and training-environment biases. To mitigate this, the authors propose the BDATP framework, which incorporates a Binaural Difference Attention (BDA) module to explicitly model spatial auditory cues, thereby reducing dependence on sound semantics. Additionally, Action Transition Prediction (ATP) is introduced as an auxiliary task to regularize policy learning through joint optimization of perception and decision-making. The method seamlessly integrates into mainstream navigation baselines and achieves substantial performance gains in unseen environments with unheard sounds on both Replica and Matterport3D datasets. Notably, it improves absolute success rate by 21.6 percentage points on Replica, establishing a new state-of-the-art.
📝 Abstract
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.