🤖 AI Summary
To address the challenge of audio-visual object alignment in complex scenes with multiple objects and concurrent sound sources, this paper proposes an object-aware interactive audio-visual generation framework: given a user click on any object in an image, the system synthesizes its corresponding sound. Methodologically, we design a conditional latent diffusion model grounded in object-centric learning, integrating instance-level image segmentation with a novel multimodal attention mechanism. We theoretically prove that this attention mechanism approximates object masks, thereby providing interpretable guarantees for audio-object alignment. Extensive experiments on multiple benchmarks demonstrate significant improvements over state-of-the-art baselines in quantitative audio-object alignment metrics. Qualitative results confirm fine-grained, controllable object-level sound synthesis. Our core contributions are threefold: (i) the first interactive, object-aware audio generation paradigm; (ii) a theoretically grounded connection between multimodal attention and visual segmentation; and (iii) strong empirical validation of alignment fidelity and controllability.
📝 Abstract
Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/