EVTAR: End-to-End Try on with Additional Unpaired Visual Reference

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual try-on methods rely on complex annotations—such as human pose, keypoints, or segmentation masks—limiting their practical applicability. This paper proposes an end-to-end virtual try-on framework requiring only a source person image and a target garment image. To enhance texture consistency and fine-detail preservation, we introduce unpaired reference images—i.e., the same garment worn by different individuals. Our method features: (1) a two-stage training strategy; (2) a visual reference feature injection mechanism that emulates human-like reference-based styling decisions; and (3) an adversarial learning framework requiring no dense annotations or additional supervision. Evaluated on two mainstream benchmarks, our approach achieves state-of-the-art performance in garment structural accuracy, texture fidelity, and visual naturalness. Notably, it is the first to enable high-quality, cross-person texture-consistent virtual try-on using only lightweight, annotation-free inputs.

Technology Category

Application Category

📝 Abstract
We propose EVTAR, an End-to-End Virtual Try-on model with Additional Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance try-on accuracy. Most existing virtual try-on approaches rely on complex inputs such as agnostic person images, human pose, densepose, or body keypoints, making them labor-intensive and impractical for real-world applications. In contrast, EVTAR adopts a two-stage training strategy, enabling simple inference with only the source image and the target garment inputs. Our model generates try-on results without masks, densepose, or segmentation maps. Moreover, EVTAR leverages additional reference images of different individuals wearing the same clothes to preserve garment texture and fine-grained details better. This mechanism is analogous to how humans consider reference models when choosing outfits, thereby simulating a more realistic and high-quality dressing effect. We enrich the training data with supplementary references and unpaired person images to support these capabilities. We evaluate EVTAR on two widely used benchmarks and diverse tasks, and the results consistently validate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Directly fits garments onto person images
Uses reference images to enhance accuracy
Eliminates need for complex input data
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end virtual try-on with reference images
Two-stage training for simple inference inputs
Leverages unpaired reference images for garment details
🔎 Similar Papers
No similar papers found.
L
Liuzhuozheng Li
The University of Tokyo
Y
Yue Gong
360 AI Research
S
Shanyuan Liu
360 AI Research
B
Bo Cheng
360 AI Research
Yuhang Ma
Yuhang Ma
Bytedance, University College London
Generative AIMulti-module Pretraining(Conditional) Text-to-image Generation (AIGC)
L
Liebucha Wu
360 AI Research
Dengyang Jiang
Dengyang Jiang
Northwestern Polytechnical University
Computer VisionDeep LearningMachine Learning
Z
Zanyi Wang
University of California San Diego
Dawei Leng
Dawei Leng
Dr.
Multimodal UnderstandingMultimodal GenerationVision and Language
Y
Yuhui Yin
360 AI Research