OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work proposes Omni-VTON, a unified framework addressing key limitations in virtual try-on (VTON) and virtual take-off (VTOFF) tasks—namely insufficient detail preservation, weak generalization, and low inference efficiency. Built upon a diffusion Transformer architecture, Omni-VTON introduces Shifted Window Attention into diffusion models for the first time to reduce computational overhead and incorporates a self-evolving data pipeline to generate high-quality multimodal training data. The framework enables end-to-end joint modeling of VTON and VTOFF through conditional fusion, token concatenation, and adaptive positional encoding. It achieves state-of-the-art performance in model-based settings and significantly outperforms existing methods in model-free scenarios, while simultaneously enhancing generation quality in complex scenes and improving inference efficiency.

Technology Category

Application Category

📝 Abstract

Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.

Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On

fine-grained detail preservation

complex scenes generalization

efficient inference

Try-Off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

Omni-VTON

Shifted Window Attention