Sounding that Object: Interactive Object-Aware Image to Audio Generation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the challenge of audio-visual object alignment in complex scenes with multiple objects and concurrent sound sources, this paper proposes an object-aware interactive audio-visual generation framework: given a user click on any object in an image, the system synthesizes its corresponding sound. Methodologically, we design a conditional latent diffusion model grounded in object-centric learning, integrating instance-level image segmentation with a novel multimodal attention mechanism. We theoretically prove that this attention mechanism approximates object masks, thereby providing interpretable guarantees for audio-object alignment. Extensive experiments on multiple benchmarks demonstrate significant improvements over state-of-the-art baselines in quantitative audio-object alignment metrics. Qualitative results confirm fine-grained, controllable object-level sound synthesis. Our core contributions are threefold: (i) the first interactive, object-aware audio generation paradigm; (ii) a theoretically grounded connection between multimodal attention and visual segmentation; and (iii) strong empirical validation of alignment fidelity and controllability.

Technology Category

Application Category

📝 Abstract

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

Problem

Research questions and friction points this paper is trying to address.

Generating accurate sounds for complex audio-visual scenes

Associating image regions with corresponding sounds interactively

Ensuring audio aligns with user-selected visual objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive object-aware audio generation model

Object-centric learning in latent diffusion

Multimodal attention aligns sounds with objects

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound