🤖 AI Summary
This work addresses the identity drift, garment distortion, and temporal inconsistency commonly arising from the staged processing pipelines in traditional virtual try-on and human animation. To overcome these limitations, the authors propose a unified end-to-end framework that directly generates high-fidelity dressed human animations from a single person image, a clothing image, and a pose-guiding video. The approach introduces a novel large-scale supervised dataset constructed from synthetic triplets, enabling zero-shot garment interpolation, and employs a dual-module architecture that jointly optimizes identity preservation and garment detail fidelity. Training leverages a video diffusion Transformer to ensure temporal coherence. Experimental results demonstrate significant improvements over existing methods in terms of garment fit, pose alignment, and visual quality across diverse clothing types.
📝 Abstract
We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.