ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge in image-conditioned virtual try-on (IVTON) of simultaneously modeling global contextual information and generating fine-grained garment details. To this end, we propose a lightweight masked diffusion Transformer architecture, replacing the computationally intensive U-Net backbone. Our key contributions are: (1) an Image–Timestep Adaptive Feature Aggregator (ITAFA) that dynamically fuses multi-scale encoded features; (2) a Salient Region Extractor (SRE) that injects high-resolution conditional guidance specifically into garment-critical regions, enhancing both detail fidelity and inference efficiency; and (3) a latent-space masking strategy integrated with image encoder feature fusion. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks—including VITON-HD, DressCode, and AT-VTON—while significantly reducing computational overhead. The method preserves strong visual realism and ensures consistent human–garment structural alignment, achieving an optimal trade-off between efficiency and quality.

Technology Category

Application Category

📝 Abstract
This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.
Problem

Research questions and friction points this paper is trying to address.

Improves virtual try-on by handling global and fine-grained garment details
Reduces computational overhead with a lightweight transformer-based diffusion model
Enhances detail preservation in salient garment regions via adaptive conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Masked Diffusion Transformer for garment context
Implements Image-Timestep Adaptive Feature Aggregator
Incorporates Salient Region Extractor for detail preservation
🔎 Similar Papers
No similar papers found.
J
Jiajing Hong
Korea Advanced Institute of Science and Technology (KAIST), South Korea
Tri Ton
Tri Ton
KAIST, Korea
Computer Vision
T
Trung X. Pham
Korea Advanced Institute of Science and Technology (KAIST), South Korea
Gwanhyeong Koo
Gwanhyeong Koo
KAIST
Sunjae Yoon
Sunjae Yoon
KAIST
Deep LearningComputer VisionGenerative AI
C
C. D. Yoo
Korea Advanced Institute of Science and Technology (KAIST), South Korea