VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

๐Ÿ“… 2025-01-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of simultaneously achieving high-fidelity appearance reconstruction and precise motion control in video object insertion, this paper proposes a zero-shot video object insertion framework. Methodologically, it introduces a pixel-wise differentiable warper module integrated with an ID-Box coordination mechanism, synergistically combining text-to-video diffusion models, ID identity extractors, dynamic bounding box sequences, and a reweighted reconstruction lossโ€”requiring only a single reference image, sparse keypoints, and motion trajectories for natural spatiotemporal fusion. Its key contribution lies in breaking the appearance-motion co-modeling bottleneck, enabling zero-shot cross-scene generalization without fine-tuning (e.g., virtual try-on, talking-head generation). Experiments demonstrate significant improvements over state-of-the-art methods across multiple quantitative metrics, supporting pixel-level motion editing and concurrent multi-region insertion. All downstream applications operate in a fully zero-shot manner.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

High-fidelity object insertion
Precise motion control
Video editing limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoAnydoor
UnsupervisedObjectInsertion
AdaptiveShapeAdjustment
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yuanpeng Tu
The University of Hong Kong, DAMO Academy, Alibaba Group
H
Hao Luo
DAMO Academy, Alibaba Group, Hupan Lab
X
Xi Chen
The University of Hong Kong
Sihui Ji
Sihui Ji
The University of Hong Kong
AIGCComputer Vision
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence