SoundBrush: Sound as a Brush for Visual Scene Editing

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper introduces the first end-to-end audio-driven visual scene editing framework, addressing the problem of *how to semantically edit 2D/3D images or videos conditioned on arbitrary real-world audio*. Methodologically, it maps audio features—extracted via CLAP or ASR encoders—into the text embedding space of latent diffusion models (LDMs), enabling cross-modal alignment and semantic controllability; edits preserve original scene structure and content while precisely modulating ambiance or inserting sound-emitting objects. Key contributions include: (1) the first end-to-end mapping paradigm from audio to visual editing; (2) the first scalable audio-driven framework supporting consistent multi-view editing of 3D scenes; and (3) empirical validation on a newly constructed soundscape-paired dataset, demonstrating high-fidelity, high-quality editing capability for in-the-wild audio inputs.

Technology Category

Application Category

📝 Abstract

We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at https://soundbrush.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Audio-driven Visual Editing

Content Integrity

Audio-Visual Synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

SoundBrush

Latent Diffusion Model

Audio-Driven Visual Editing

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound