🤖 AI Summary
Existing virtual try-on methods rely on multi-view images or physics-based priors, making it challenging to achieve dynamic interaction and 4D try-on—encompassing free-pose control, novel-view synthesis, and diverse garment customization—under single-view supervision. This paper introduces the first end-to-end, single-view-driven 4D virtual try-on framework. We propose a physics-free nonlinear Gaussian deformation network, coupled with a novel bidirectional optical flow correction strategy, to model temporally consistent and adaptive garment dynamics. Given only a single reference garment image and a target human pose sequence, our method generates highly photorealistic and temporally coherent dynamic try-on results. Extensive evaluations on multiple benchmarks demonstrate significant improvements over state-of-the-art approaches. The framework enables practical applications in AR/VR, digital avatars, and gaming, advancing 4D virtual try-on toward lightweight, general-purpose deployment.
📝 Abstract
We propose AvatarVTON, the first 4D virtual try-on framework that generates realistic try-on results from a single in-shop garment image, enabling free pose control, novel-view rendering, and diverse garment choices. Unlike existing methods, AvatarVTON supports dynamic garment interactions under single-view supervision, without relying on multi-view garment captures or physics priors. The framework consists of two key modules: (1) a Reciprocal Flow Rectifier, a prior-free optical-flow correction strategy that stabilizes avatar fitting and ensures temporal coherence; and (2) a Non-Linear Deformer, which decomposes Gaussian maps into view-pose-invariant and view-pose-specific components, enabling adaptive, non-linear garment deformations. To establish a benchmark for 4D virtual try-on, we extend existing baselines with unified modules for fair qualitative and quantitative comparisons. Extensive experiments show that AvatarVTON achieves high fidelity, diversity, and dynamic garment realism, making it well-suited for AR/VR, gaming, and digital-human applications.