VAInpaint: Zero-Shot Video-Audio inpainting framework with LLMs-driven Module

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work introduces the first zero-shot audio-video joint inpainting framework, designed to precisely remove a specified object from a video along with its corresponding audio while preserving all other content intact. Methodologically, it integrates a segmentation model to generate spatial masks for object-aware video inpainting and leverages a large language model (LLM) to capture global-local scene semantics, enabling text-query-driven audio source separation—thus achieving cross-modal semantic alignment and audio-video co-inpainting. A multi-stage pipeline—segmentation → video inpainting → LLM-based semantic bridging → text-guided audio separation—is employed, with fine-tuning on a custom dataset to enhance generalization. Experiments demonstrate that the method matches or surpasses state-of-the-art baselines in video/audio reconstruction quality, multimodal editing consistency, and zero-shot generalization capability, significantly advancing fine-grained, semantically controllable audio-video joint editing.

Technology Category

Application Category

📝 Abstract

Video and audio inpainting for mixed audio-visual content has become a crucial task in multimedia editing recently. However, precisely removing an object and its corresponding audio from a video without affecting the rest of the scene remains a significant challenge. To address this, we propose VAInpaint, a novel pipeline that first utilizes a segmentation model to generate masks and guide a video inpainting model in removing objects. At the same time, an LLM then analyzes the scene globally, while a region-specific model provides localized descriptions. Both the overall and regional descriptions will be inputted into an LLM, which will refine the content and turn it into text queries for our text-driven audio separation model. Our audio separation model is fine-tuned on a customized dataset comprising segmented MUSIC instrument images and VGGSound backgrounds to enhance its generalization performance. Experiments show that our method achieves performance comparable to current benchmarks in both audio and video inpainting.

Problem

Research questions and friction points this paper is trying to address.

Removing objects and corresponding audio from videos precisely

Addressing mixed audio-visual content inpainting challenges

Maintaining scene integrity while removing specific visual-audio elements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmentation model generates masks for video inpainting

LLM analyzes scene globally and refines text queries

Audio separation model fine-tuned on customized dataset

🔎 Similar Papers

No similar papers found.