🤖 AI Summary
Existing virtual try-on methods rely on multi-image inputs (e.g., pose maps, parsing maps), complex conditional networks, and lengthy diffusion sampling—resulting in high computational overhead and deployment challenges. This paper proposes a lightweight, controllable diffusion Transformer (DiT) architecture that requires only two inputs: a masked person image and a clothing image—eliminating reference networks, image encoders, and redundant conditioning. It enables end-to-end virtual try-on. We introduce Minimal Control, the first control paradigm to directly adapt the DiT backbone to try-on tasks. A novel conditional fusion mechanism and knowledge distillation–based acceleration reduce inference steps to just 8 while increasing trainable parameters by only 0.33%. Our method achieves state-of-the-art performance across both qualitative and quantitative benchmarks, with significantly improved detail fidelity.
📝 Abstract
Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.