InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the prevalent neglect of audio in existing video editing methods, which often leads to audio-visual inconsistency. To overcome this limitation, the authors propose the first end-to-end instruction-guided framework for joint audio-visual editing. They introduce InsAVE-80K, a large-scale high-quality dataset, and devise a source-instruction gated attention mechanism alongside a two-stage training strategy to enable controllable and synchronized audio-visual generation. Built upon diffusion models with unified audio-visual modeling, the proposed method significantly outperforms state-of-the-art approaches across 11 metrics spanning three evaluation dimensions on two benchmark datasets, markedly enhancing both editing consistency and controllability.

📝 Abstract

Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: https://hjzheng.net/projects/InstructAV2AV/.

Problem

Research questions and friction points this paper is trying to address.

audio-video editing

instruction-guided generation

multimodal content manipulation

video editing

audio-visual consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-guided editing

audio-video joint editing

diffusion models