Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of cross-modal synchronization and semantic inconsistency in zero-shot audio-video joint editing. We propose the first fine-tuning-free, text-driven audio-video co-editing framework. Methodologically, we introduce a cross-modal delta denoising mechanism that explicitly models temporal and semantic interactions between audio and video features, augmented by an audio-video feature alignment constraint to enforce modality consistency. To enable systematic evaluation, we construct AvED-Bench—the first dedicated benchmark for audio-video editing—and propose the OAVE (Objective Audio-Video Editing) evaluation protocol. Experiments demonstrate that our method significantly outperforms existing zero-shot unimodal editing approaches in both temporal coherence and cross-modal fidelity, achieving high-quality, tightly synchronized, text-controllable audio-video generation.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html
Problem

Research questions and friction points this paper is trying to address.

Zero-shot audio-video editing without model training
Synchronization and coherence issues in cross-modal editing
Creating aligned auditory-visual content from text prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot cross-modal delta denoising framework
Leverages audio-video interactions for synchronization
No additional model training required
🔎 Similar Papers
No similar papers found.