🤖 AI Summary
This work addresses the scarcity and quality inconsistency of authentic audio-video pairs in audio-image generation. We propose the first scalable, audio-video-pair-free visual sonification framework. Methodologically, we leverage vision-language models (VLMs) for cross-modal semantic retrieval to automatically construct semantically aligned synthetic audio-image pairs from unpaired image and audio corpora; these pairs then train a diffusion model that implicitly models auditory spatial characteristics—including loudness calibration, reverberation, and semantic audio mixing. Our contributions are threefold: (1) establishing the first training paradigm that requires no authentic audio-video supervision; (2) significantly improving generalization and auditory controllability under low-data or low-quality conditions; and (3) achieving state-of-the-art performance on standard benchmarks while demonstrating strong robustness and disentangled auditory control across multiple perceptual evaluation dimensions.
📝 Abstract
Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.