MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing virtual try-on methods typically process upper and lower garments separately, rely on cumbersome preprocessing pipelines, and struggle to preserve individual characteristics—such as tattoos, accessories, and body shape—leading to limited visual realism and personalization. To address these limitations, we propose A-DiT, the first unified diffusion-based Transformer framework for virtual try-on. A-DiT jointly models multi-garment synthesis and identity-preserving features within a shared latent space through three synergistic components: Garment Representation, Person Representation, and Textual Prompt Encoding. Crucially, it deeply integrates garment semantics, person identity, and fine-grained textual guidance to enable precise, controllable editing. Evaluated on VITON-HD and DressCode, A-DiT achieves state-of-the-art performance in both visual quality and identity fidelity, significantly advancing high-fidelity, personalized virtual try-on capabilities.

Technology Category

Application Category

📝 Abstract

Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.

Problem

Research questions and friction points this paper is trying to address.

Handles upper and lower garments simultaneously for virtual try-on

Preserves personal identity cues like tattoos and body shape

Reduces preprocessing needs while enhancing realism and flexibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-garment diffusion framework

Garment and person representation modules

Prompt-based customization via diffusion transformer

🔎 Similar Papers

No similar papers found.