Spatial-Aware Conditioned Fusion for Audio-Visual Navigation

๐Ÿ“… 2026-04-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing audio-visual navigation approaches often rely on simplistic feature concatenation or late fusion strategies, lacking explicit modeling of the targetโ€™s relative spatial location, which results in low learning efficiency and poor generalization across unseen sound sources. This work proposes a novel fusion mechanism that first discretizes the targetโ€™s relative direction and distance, predicts their joint distribution, and encodes it into a compact spatial descriptor. This descriptor, together with the audio embedding, generates channel-wise scaling and bias parameters to conditionally modulate visual features via affine transformations, enabling goal-oriented, efficient multimodal fusion. By introducing explicit spatial discretization and conditional feature modulation for the first time, the method achieves significantly improved navigation performance with minimal computational overhead and demonstrates strong generalization to novel sound sources.
๐Ÿ“ Abstract
Audio-visual navigation tasks require agents to locate and navigate toward continuously vocalizing targets using only visual observations and acoustic cues. However, existing methods mainly rely on simple feature concatenation or late fusion, and lack an explicit discrete representation of the target's relative position, which limits learning efficiency and generalization. We propose Spatial-Aware Conditioned Fusion (SACF). SACF first discretizes the target's relative direction and distance from audio-visual cues, predicts their distributions, and encodes them as a compact descriptor for policy conditioning and state modeling. Then, SACF uses audio embeddings and spatial descriptors to generate channel-wise scaling and bias to modulate visual features via conditional linear transformation, producing target-oriented fused representations. SACF improves navigation efficiency with lower computational overhead and generalizes well to unheard target sounds.
Problem

Research questions and friction points this paper is trying to address.

audio-visual navigation
feature fusion
spatial representation
relative position
multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Aware Conditioned Fusion
audio-visual navigation
conditional feature modulation
discrete spatial representation
cross-modal fusion
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shaohang Wu
Joint Research Laboratory for Embodied Intelligence, Xinjiang University; Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University; School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China
Yinfeng Yu
Yinfeng Yu
Associate Professor, Xinjiang University
Embodied intelligence