Region-Constraint In-Context Generation for Instructional Video Editing

πŸ“… 2025-12-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address inaccurate region localization, inter-frame drift, and token interference in instruction-driven video editing, this paper proposes a region-constrained contextual generation paradigm. Methodologically, it introduces the first video-level region constraint modeling framework; designs a dual regularization mechanism comprising latent-space difference regularization and attention masking suppression; and incorporates width-wise source-target video concatenation with one-step backward denoising optimization. Contributions include: (1) constructing ReCo-Dataβ€”the first large-scale, high-quality dataset of 500K instruction-video pairs; and (2) achieving state-of-the-art performance across four benchmark tasks, with significant improvements in editing region accuracy, inter-frame consistency, and effective suppression of irrelevant region generation and cross-frame interference.

Technology Category

Application Category

πŸ“ Abstract
The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.
Problem

Research questions and friction points this paper is trying to address.

Addresses inaccurate editing regions in instructional video editing
Mitigates token interference between editing and non-editing areas
Enhances constraint modeling for precise video content generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-constraint in-context generation for video editing
Latent and attention regularization for denoising calibration
Large-scale dataset with 500K instruction-video pairs
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhongwei Zhang
University of Science and Technology of China
Fuchen Long
Fuchen Long
University of Science and Technology of China
Video Analysis
W
Wei Li
University of Science and Technology of China
Zhaofan Qiu
Zhaofan Qiu
AI Research, JD.COM
Deep LearningComputer VisionMultimedia
W
Wu Liu
University of Science and Technology of China
T
Ting Yao
HiDream.ai Inc.
Tao Mei
Tao Mei
HiDream.ai; Fellow of CAE/IEEE/IAPR/CAAI
Multimedia AnalysisComputer VisionGenerative AIArtificial Intelligence