π€ AI Summary
This work addresses the challenge of generating spatially coherent and geographically realistic soundscapes from satellite imagery, which suffers from semantic ambiguity due to its top-down perspective and large coverage area. We introduce a novel task of satellite-to-soundscape generation and propose Geo2Sound, a unified framework that leverages a lightweight geographic attribute classifier to model spatial structure, produces multiple semantic soundscape hypotheses, and employs a geo-acoustic embedding alignment module to select the optimal output. To facilitate research in this domain, we also release SatSound-Bench, the first large-scale benchmark of paired satellite images and audio recordings. Experiments demonstrate that our method achieves a state-of-the-art FAD score of 1.765 on SatSound-Bench, representing a 50.0% improvement over the strongest baseline; human evaluations further confirm a 26.5% gain in realism and significantly enhanced semantic and geographic alignment.
π Abstract
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable source for global soundscape generation, matching these views to real acoustic environments with unique spatial structures is inherently difficult. To address this challenge, we introduce Geo2Sound, a novel task and framework for generating geographically realistic soundscapes from satellite imagery. Specifically, Geo2Sound combines structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment in a unified framework. A lightweight classifier summarizes overhead scenes into compact geographic attributes, multiple sound-oriented semantic hypotheses are used to generate diverse acoustically plausible candidates, and a geo-acoustic alignment module projects geographic attributes into the acoustic embedding space and identifies the candidate most consistent with the candidate sets. Moreover, we establish SatSound-Bench, the first benchmark comprising over 20k high-quality paired satellite images, text descriptions, and real-world audio recordings, collected from the field across more than 10 countries and complemented by three public datasets. Experiments show that Geo2Sound achieves a SOTA FAD of 1.765, outperforming the strongest baseline by 50.0%. Human evaluations further confirm substantial gains in both realism (26.5%) and semantic alignment, validating our high-fidelity synthesis on scale. Project page and source code: https://github.com/Blanketzzz/Geo2Sound