Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity and quality inconsistency of authentic audio-video pairs in audio-image generation. We propose the first scalable, audio-video-pair-free visual sonification framework. Methodologically, we leverage vision-language models (VLMs) for cross-modal semantic retrieval to automatically construct semantically aligned synthetic audio-image pairs from unpaired image and audio corpora; these pairs then train a diffusion model that implicitly models auditory spatial characteristics—including loudness calibration, reverberation, and semantic audio mixing. Our contributions are threefold: (1) establishing the first training paradigm that requires no authentic audio-video supervision; (2) significantly improving generalization and auditory controllability under low-data or low-quality conditions; and (3) achieving state-of-the-art performance on standard benchmarks while demonstrating strong robustness and disentangled auditory control across multiple perceptual evaluation dimensions.

Technology Category

Application Category

📝 Abstract
Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.
Problem

Research questions and friction points this paper is trying to address.

Audio-Image Conversion
Data Pairing
Model Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Image Synthesis
Generalization Improvement
Complex Sound Processing
🔎 Similar Papers
No similar papers found.