MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual try-on methods typically process upper and lower garments separately, rely on cumbersome preprocessing pipelines, and struggle to preserve individual characteristics—such as tattoos, accessories, and body shape—leading to limited visual realism and personalization. To address these limitations, we propose A-DiT, the first unified diffusion-based Transformer framework for virtual try-on. A-DiT jointly models multi-garment synthesis and identity-preserving features within a shared latent space through three synergistic components: Garment Representation, Person Representation, and Textual Prompt Encoding. Crucially, it deeply integrates garment semantics, person identity, and fine-grained textual guidance to enable precise, controllable editing. Evaluated on VITON-HD and DressCode, A-DiT achieves state-of-the-art performance in both visual quality and identity fidelity, significantly advancing high-fidelity, personalized virtual try-on capabilities.

Technology Category

Application Category

📝 Abstract
Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.
Problem

Research questions and friction points this paper is trying to address.

Handles upper and lower garments simultaneously for virtual try-on
Preserves personal identity cues like tattoos and body shape
Reduces preprocessing needs while enhancing realism and flexibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-garment diffusion framework
Garment and person representation modules
Prompt-based customization via diffusion transformer
🔎 Similar Papers
No similar papers found.