SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video

πŸ“… 2025-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video-based body-swapping methods decompose the task into multiple sequential subproblems, leading to suboptimal end-to-end optimization and resulting in inter-frame brightness inconsistency, erroneous occlusion handling, and unnatural subject-background separation. This work formally defines body swapping as a standalone task and unifies it as a reference-fidelity-preserving and motion-controllable video inpainting problem, introducing three consistency constraints: identity, motion, and environment. We propose EnvHarmony, a progressive training strategy to enhance illumination and background harmonization. Furthermore, we construct and publicly release HumanAction-32K, a large-scale, diverse video dataset. Our approach leverages an end-to-end diffusion model integrating temporal video modeling, reference-guided feature fusion, and action-driven motion control. It achieves state-of-the-art performance among open-source solutions, with quantitative metrics competitive with proprietary models. All code, pretrained models, and the HumanAction-32K dataset are fully open-sourced.

Technology Category

Application Category

πŸ“ Abstract
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.
Problem

Research questions and friction points this paper is trying to address.

Achieves end-to-end video body-swapping optimization.
Ensures identity, motion, and environment consistency.
Introduces EnvHarmony for luminance harmony in videos.
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end video body-swapping model
EnvHarmony strategy for luminance harmony
HumanAction-32K dataset for diverse training
πŸ”Ž Similar Papers
No similar papers found.