Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the identity drift, garment distortion, and temporal inconsistency commonly arising from the staged processing pipelines in traditional virtual try-on and human animation. To overcome these limitations, the authors propose a unified end-to-end framework that directly generates high-fidelity dressed human animations from a single person image, a clothing image, and a pose-guiding video. The approach introduces a novel large-scale supervised dataset constructed from synthetic triplets, enabling zero-shot garment interpolation, and employs a dual-module architecture that jointly optimizes identity preservation and garment detail fidelity. Training leverages a video diffusion Transformer to ensure temporal coherence. Experimental results demonstrate significant improvements over existing methods in terms of garment fit, pose alignment, and visual quality across diverse clothing types.

Technology Category

Application Category

📝 Abstract

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Problem

Research questions and friction points this paper is trying to address.

virtual try-on

human animation

identity drift

garment distortion

pose-guided synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

virtual try-on

human animation

triplet supervision