Audio-Guided Visual Editing with Complex Multi-Modal Prompts

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-guided visual editing methods rely on dataset-specific training for audio–text modality alignment, exhibiting poor generalization; pure text guidance, in turn, struggles to represent complex scenes. This paper proposes the first zero-shot, fine-tuning-free framework for joint audio–text-guided visual editing. It leverages a pre-trained multimodal encoder to achieve cross-modal space alignment, mapping audio embeddings into the prompt space of diffusion models. A novel architecture features a disentangled noise branch and an adaptive patch selection mechanism to decouple and fuse multimodal prompts. Our method is the first to enable collaborative, zero-shot editing driven jointly by multiple audio inputs and textual descriptions—significantly enhancing modeling capability for soundscape-related scenes (e.g., “rain sound + window”). Experiments demonstrate superior performance over text-only baselines in editing fidelity, semantic consistency, and task generalization.

Technology Category

Application Category

📝 Abstract
Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model's prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.
Problem

Research questions and friction points this paper is trying to address.

Handling complex visual editing scenarios with multi-modal prompts
Integrating audio guidance without requiring additional training
Aligning audio encoder space with diffusion model's prompt space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-guided visual editing without training
Multi-modal encoder with zero-shot capabilities
Separate noise branching and adaptive selection
🔎 Similar Papers
No similar papers found.
H
Hyeonyu Kim
MAUM AI Inc., Republic of Korea
S
Seokhoon Jeong
Artificial Intelligence Graduate School, UNIST, Republic of Korea
S
Seonghee Han
Artificial Intelligence Graduate School, UNIST, Republic of Korea
C
Chanhyuk Choi
Artificial Intelligence Graduate School, UNIST, Republic of Korea
Taehwan Kim
Taehwan Kim
UNIST
Machine LearningComputer VisionLanguage Processing