Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

📅 2025-05-22
🏛️ International Conference on Learning Representations
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address garment deformation artifacts and texture detail loss in diffusion-based virtual try-on (VTON), this paper proposes an explicit visual correspondence-guided diffusion generation framework. Methodologically, it introduces a novel point-focused diffusion loss that jointly models fine-grained semantic point matching and 3D-aware geometric cues (depth and surface normal maps) to enforce realistic geometric constraints and local garment deformations during dressing simulation. A dual-UNet architecture is designed to integrate local optical flow-based warping, semantic point alignment, and 3D geometric priors. Evaluated on VITON-HD and DressCode benchmarks, the method achieves state-of-the-art performance, significantly improving garment shape fidelity and texture detail preservation. This work establishes a new paradigm for controllable, geometry-aware garment synthesis in diffusion-based VTON.

Technology Category

Application Category

📝 Abstract
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.
Problem

Research questions and friction points this paper is trying to address.

Preserving garment shape and details in virtual try-on
Using visual correspondence to guide diffusion models
Enhancing 3D-aware cues for semantic point matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses visual correspondence as diffusion prior
Augments 2D points with 3D-aware cues
Introduces point-focused diffusion loss
🔎 Similar Papers
No similar papers found.