🤖 AI Summary
This work investigates the feasibility of geographic localization using only natural sounds (e.g., wildlife vocalizations). We formally define the audio-based geolocation task and introduce the first global-scale benchmark, built upon the iNatSounds dataset, to systematically evaluate the geolocative capacity of acoustic signals. Methodologically, we propose a novel paradigm integrating species distribution modeling with retrieval-based localization, incorporating spatiotemporal neighborhood aggregation and species richness weighting. We further extend this framework to multimodal audiovisual geolocation in cinematic scenes. Results demonstrate that bioacoustic signatures exhibit strong geographic specificity, substantially outperforming audio-only baselines in country- and state-level localization; multimodal fusion further improves accuracy. To foster research in audio-based geographic perception, we publicly release the benchmark dataset, evaluation protocols, and source code.
📝 Abstract
Can we determine someone's geographic location purely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? We tackle the challenge of global-scale audio geolocation, formalize the problem, and conduct an in-depth analysis with wildlife audio from the iNatSounds dataset. Adopting a vision-inspired approach, we convert audio recordings to spectrograms and benchmark existing image geolocation techniques. We hypothesize that species vocalizations offer strong geolocation cues due to their defined geographic ranges and propose an approach that integrates species range prediction with retrieval-based geolocation. We further evaluate whether geolocation improves when analyzing species-rich recordings or when aggregating across spatiotemporal neighborhoods. Finally, we introduce case studies from movies to explore multimodal geolocation using both audio and visual content. Our work highlights the advantages of integrating audio and visual cues, and sets the stage for future research in audio geolocation.