LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing video editing datasets are limited by high annotation costs, resulting in small scale, low quality, and narrow task coverage. To address this, this work proposes LIVE, a framework that, for the first time, effectively leverages large-scale, high-quality image editing priors for instruction-driven video editing. LIVE bridges the domain gap between images and videos through a frame-level token noise mechanism and enhances model generalization via a two-stage training strategy based on pretrained video generation models combined with an automated data construction pipeline. Evaluated on a new benchmark encompassing over 60 challenging tasks, LIVE substantially outperforms existing methods, significantly expanding both the scope and performance ceiling of video editing capabilities.

Technology Category

Application Category

📝 Abstract

Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

video editing

annotation cost

data scarcity

task diversity

dataset limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

image-to-video editing

joint training framework

frame-wise token noise