DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing virtual try-on (VTO) methods suffer from limitations in fine-detail preservation, real-image robustness, sampling efficiency, editing flexibility, and cross-category generalization. To address these challenges, we propose DiT-VTON—the first unified VTO framework integrating Diffusion Transformers, enabling high-fidelity virtual try-on and try-on for diverse clothing items and accessories across multiple categories. We introduce the novel “virtual try-all” paradigm, which inherently preserves pose, supports local editing, and enables texture transfer—without requiring auxiliary conditional encoders. Structural and textural modeling is enhanced via contextual token concatenation, channel-wise feature concatenation, and ControlNet-based fusion. DiT-VTON achieves significant improvements over state-of-the-art methods on VITON-HD and outperforms editable baselines on a large-scale multimodal dataset comprising thousands of product categories. The method demonstrates strong robustness, superior detail fidelity, and broad generalization capability.

Technology Category

Application Category

📝 Abstract

The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation, adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities such as pose preservation, localized editing, texture transfer, and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on VITON-HD, achieving superior detail preservation and robustness without reliance on additional condition encoders. It also outperforms models with VTA and image editing capabilities on a diverse dataset spanning thousands of product categories.

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained detail preservation in virtual try-on

Improving robustness across diverse real-world product categories

Integrating advanced image editing capabilities into virtual try-all

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Diffusion Transformer for image-conditioned virtual try-on

Integrates ControlNet and multiple configurations for enhanced conditioning

Expands dataset and capabilities for versatile Virtual Try-All solution

🔎 Similar Papers

No similar papers found.