🤖 AI Summary
Virtual try-on faces fundamental challenges including poor garment-human alignment, heavy reliance on segmentation masks or pose annotations, and scarce high-quality training data. This paper introduces the first zero-prior, end-to-end arbitrary-image-to-arbitrary-image virtual try-on framework: it requires neither human parsing, keypoint detection, nor garment masks—only a source garment image and a text instruction to synthesize a target person wearing the garment. Key contributions include (1) an adaptive positional embedding mechanism for precise spatial correspondence; (2) LAION-Garment, the first large-scale, open-source garment dataset (currently the largest); and (3) a novel diffusion training paradigm integrating synthetic data augmentation with multi-condition alignment. The method demonstrates robust generalization across diverse garment sizes, categories, and multimodal inputs, significantly outperforming state-of-the-art methods (e.g., MMTryon) on multiple benchmarks, achieving high-fidelity, highly controllable, mask-free, and real-time virtual try-on generation.
📝 Abstract
Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person's image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with existing methods. The results show that Any2AnyTryon enables flexible, controllable, and high-quality image-based virtual try-on generation.https://logn-2024.github.io/Any2anyTryonProjectPage/