Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the audio-video desynchronization problem arising from video editing. We propose a text-guided joint audio-video editing framework centered on a novel conditional video-to-audio generation model. The model jointly conditions on the source audio, target video frames, and textual prompts; a dynamic weight modulation network adaptively fuses structural cues from the source audio with semantic guidance from the text. Additionally, we introduce a conditional audio encoder architecture and targeted data augmentation strategies to ensure precise audio alignment with visually edited content. Compared to existing methods, our approach achieves significant improvements in three key dimensions: temporal audio-video alignment, semantic content consistency, and auditory naturalness. Extensive evaluations across multiple benchmark datasets demonstrate state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
Problem

Research questions and friction points this paper is trying to address.

Enhances coherence between edited video and audio
Aligns audio with visual changes after video editing
Maintains audio-visual alignment and content integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-to-audio generation model with conditional audio input
Dynamic adjustment of source audio influence based on edit complexity
Data augmentation strategy for improved training efficiency
🔎 Similar Papers
No similar papers found.