🤖 AI Summary
This paper addresses the lack of geographic authenticity in cross-modal generation from soundscapes to geographically grounded real-world images, introducing for the first time the Geographic Context-Aware Soundscape-to-Landscape (GeoS2L) mapping task. Methodologically, we propose a geography-scene-conditioned diffusion Transformer architecture incorporating geographic encoding embeddings and a multi-granularity cross-modal alignment mechanism, alongside geographically semantic disentanglement of acoustic features. We construct the first large-scale paired dataset—SoundingSVI and SonicUrban—and introduce a novel multi-level geographic consistency metric, the Place Similarity Score (PSS). Experiments demonstrate that our approach significantly outperforms existing audio-to-image models in both visual fidelity and geographic plausibility. PSS evaluation confirms substantial improvements in geographic element accuracy, scene structural coherence, and human-perceptual consistency, thereby establishing a foundational benchmark for the GeoS2L task.
📝 Abstract
We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences.