TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing virtual try-on (VTO) methods are constrained by outdated text-to-image (T2I) diffusion architectures, exhibiting critical limitations in garment texture fidelity, distortion-free text rendering, and accurate material detail reconstruction. To address these challenges, this work pioneers the deep adaptation of the Diffusion Transformer (DiT) to VTO. We propose three key innovations: (1) a Garment-Semantic Adapter (GS Adapter) for fine-grained feature alignment; (2) a text-fidelity loss that explicitly preserves structural integrity of rendered text; and (3) an LLM-driven constrained prompt optimization mechanism to enhance semantic controllability. Our framework integrates semantic modeling, multi-objective loss design, and LLM-guided prompt engineering atop the DiT backbone. Extensive experiments demonstrate state-of-the-art performance in both visual quality and text fidelity, significantly outperforming prior methods across multiple benchmarks and establishing a new performance benchmark for VTO.

Technology Category

Application Category

📝 Abstract

Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models' advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task. Project page: url{https://zhenchenwan.github.io/TED-VITON/}

Problem

Research questions and friction points this paper is trying to address.

Outdated T2I models limit Virtual Try-On improvements.

Challenges in rendering garment text without distortion.

Need to preserve fine details like textures and materials.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Garment Semantic Adapter enhances garment features

Text Preservation Loss ensures distortion-free text rendering

LLM-optimized prompts improve text generation accuracy

🔎 Similar Papers

No similar papers found.