🤖 AI Summary
This work addresses the semantic ambiguity inherent in vision-only global geolocation by proposing a multimodal framework that integrates auditory cues. The method first extracts interpretable “acoustic atoms” using a hybrid autoregressive sparse autoencoder, then fuses these with visual features through a fine-tuned multimodal large language model for geospatial reasoning. High-precision location prediction is achieved on the spherical manifold (S²) via Riemannian Flow Matching, with training further enhanced by Group Relative Policy Optimization. Key contributions include the first interpretable soundscape perception mechanism, the creation of AVG—the first large-scale audio-visual geolocation benchmark comprising 20,000 video clips—and demonstrable performance gains over unimodal baselines, thereby validating the critical role of auditory signals in global fine-grained localization.
📝 Abstract
While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded"acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.