Audio-Visual Camera Pose Estimationn with Passive Scene Sounds and In-the-Wild Video

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual degradation (e.g., motion blur, occlusion) severely impairs camera pose estimation accuracy. To address this, we propose the first end-to-end multimodal method that leverages unlabeled, non-cooperative ambient audio to augment visual pose estimation—without requiring audio annotations or scene coordination. Our approach jointly models direction-of-arrival (DOA) spectra and binaural acoustic embeddings, integrating them into a state-of-the-art visual pose network. It operates entirely on naturally recorded real-world audiovisual data. Evaluated on two large-scale real-world video datasets, our method consistently outperforms strong visual-only baselines, maintaining robust performance under severe image quality degradation. This work provides the first empirical validation that passive environmental sound serves as effective complementary spatial cues for pose estimation, establishing a novel paradigm of auditory-augmented visual localization.

Technology Category

Application Category

📝 Abstract
Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.
Problem

Research questions and friction points this paper is trying to address.

Estimates camera motion using audio cues when visual data is degraded.
Integrates sound direction and binaural embeddings into visual pose models.
Enhances robustness in real-world videos with everyday passive sounds.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates audio DOA spectra into visual pose estimation
Uses binaural embeddings to enhance camera motion tracking
Leverages passive scene sounds for robustness in degraded conditions
🔎 Similar Papers
No similar papers found.