OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

πŸ“… 2026-03-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes Omni-VTON, a unified framework addressing key limitations in virtual try-on (VTON) and virtual take-off (VTOFF) tasksβ€”namely insufficient detail preservation, weak generalization, and low inference efficiency. Built upon a diffusion Transformer architecture, Omni-VTON introduces Shifted Window Attention into diffusion models for the first time to reduce computational overhead and incorporates a self-evolving data pipeline to generate high-quality multimodal training data. The framework enables end-to-end joint modeling of VTON and VTOFF through conditional fusion, token concatenation, and adaptive positional encoding. It achieves state-of-the-art performance in model-based settings and significantly outperforms existing methods in model-free scenarios, while simultaneously enhancing generation quality in complex scenes and improving inference efficiency.

Technology Category

Application Category

πŸ“ Abstract
Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.
Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On
fine-grained detail preservation
complex scenes generalization
efficient inference
Try-Off
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
Omni-VTON
Shifted Window Attention
Self-evolving Data Curation
Virtual Try-On
W
Weixuan Zeng
The Chinese University of Hong Kong, ShenZhen
P
Pengcheng Wei
Beihang University
H
Huaiqing Wang
KuaiShou
B
Boheng Zhang
KuaiShou
Jia Sun
Jia Sun
Hong Kong University of Science and Technology (Guangzhou)
Media arts
D
Dewen Fan
KuaiShou
L
Lin HE
KuaiShou
L
Long Chen
KuaiShou
Q
Qianqian Gan
KuaiShou
F
Fan Yang
KuaiShou
T
Tingting Gao
KuaiShou