Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenges of cross-modal synchronization and semantic inconsistency in zero-shot audio-video joint editing. We propose the first fine-tuning-free, text-driven audio-video co-editing framework. Methodologically, we introduce a cross-modal delta denoising mechanism that explicitly models temporal and semantic interactions between audio and video features, augmented by an audio-video feature alignment constraint to enforce modality consistency. To enable systematic evaluation, we construct AvED-Bench—the first dedicated benchmark for audio-video editing—and propose the OAVE (Objective Audio-Video Editing) evaluation protocol. Experiments demonstrate that our method significantly outperforms existing zero-shot unimodal editing approaches in both temporal coherence and cross-modal fidelity, achieving high-quality, tightly synchronized, text-controllable audio-video generation.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html

Problem

Research questions and friction points this paper is trying to address.

Zero-shot audio-video editing without model training

Synchronization and coherence issues in cross-modal editing

Creating aligned auditory-visual content from text prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot cross-modal delta denoising framework

Leverages audio-video interactions for synchronization

No additional model training required

🔎 Similar Papers

No similar papers found.