MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video virtual try-on (VVT) methods suffer from severe spatiotemporal inconsistency and poor garment detail preservation: U-Net-based diffusion models exhibit limited representational capacity; decoupled spatial and temporal attention hinders cross-frame structural modeling; and insufficient fusion of garment texture, silhouette, and semantic information leads to dynamic try-on artifacts. This paper introduces the first end-to-end video diffusion Transformer framework, leveraging full self-attention to jointly model spatiotemporal dependencies. We propose a dual-granularity garment preservation strategy—coarse-grained via garment token embedding fusion and fine-grained via multi-condition guidance (semantic, texture, and silhouette). Additionally, we introduce a mask-aware reconstruction loss to enhance fidelity within garment regions. Our method achieves state-of-the-art performance on both image- and video-based try-on benchmarks, significantly improving structural stability, motion coherence, and perceptual realism, while demonstrating strong generalization to in-the-wild scenarios.

Technology Category

Application Category

📝 Abstract
Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
Problem

Research questions and friction points this paper is trying to address.

Improve spatiotemporal consistency in video virtual try-on
Enhance garment detail preservation during human motion
Replace U-Net with diffusion Transformer for better detail reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion Transformer for video modeling
Combines full self-attention for spatiotemporal consistency
Implements coarse-to-fine garment preservation strategy
🔎 Similar Papers
No similar papers found.
Guangyuan Li
Guangyuan Li
Zhejiang University
Low-Level VisionMedical Image AnalysisVideo Generation
Siming Zheng
Siming Zheng
UCAS, vivo
AIGC,Low-level visionComputational photography,Snapshot Compressive Imaging,Deep Learning
H
Hao Zhang
vivo Mobile Communication Co., Ltd
Jinwei Chen
Jinwei Chen
vivo
computer vision
Junsheng Luan
Junsheng Luan
Zhejiang University
B
Binkai Ou
Innovation Research & Development, BoardWare Information System Limited
L
Lei Zhao
College of Computer Science and Technology, Zhejiang University
B
Bo Li
vivo Mobile Communication Co., Ltd
Peng-Tao Jiang
Peng-Tao Jiang
Researcher, vivo
Diffusion ModelsDense PredictionsVisual Attention