Interpretable Perception and Reasoning for Audiovisual Geolocation

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the semantic ambiguity inherent in vision-only global geolocation by proposing a multimodal framework that integrates auditory cues. The method first extracts interpretable “acoustic atoms” using a hybrid autoregressive sparse autoencoder, then fuses these with visual features through a fine-tuned multimodal large language model for geospatial reasoning. High-precision location prediction is achieved on the spherical manifold (S²) via Riemannian Flow Matching, with training further enhanced by Group Relative Policy Optimization. Key contributions include the first interpretable soundscape perception mechanism, the creation of AVG—the first large-scale audio-visual geolocation benchmark comprising 20,000 video clips—and demonstrable performance gains over unimodal baselines, thereby validating the critical role of auditory signals in global fine-grained localization.

Technology Category

Application Category

📝 Abstract

While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded"acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

Problem

Research questions and friction points this paper is trying to address.

Audiovisual Geolocation

Multimodal Perception

Geographic Ambiguity

Soundscape Interpretation

Global Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audiovisual Geolocation

Interpretable Perception

Acoustic Atoms