DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual try-on methods struggle to simultaneously achieve precise garment-human geometric alignment and high-fidelity texture reconstruction. To address this, we propose a decoupled dual-scale generative framework: first modeling cross-domain semantic correspondences at low resolution, then reconstructing fine-grained textures and structural details at high resolution via a residual-guided diffusion model. Our approach is the first fully mask-free, end-to-end virtual try-on solution—eliminating reliance on human parsing maps and instead leveraging built-in semantic priors from pre-trained diffusion models to ensure appearance consistency and pose robustness. Evaluated on multiple standard benchmarks, our method achieves state-of-the-art performance in both structural alignment and texture fidelity, significantly improving image naturalness, detail sharpness, and pose consistency.

Technology Category

Application Category

📝 Abstract
Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Accurately aligning garment with target human body
Preserving fine-grained garment textures and patterns
Eliminating reliance on human parsing or segmentation masks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-scale framework for alignment and texture
Residual-guided diffusion for high-resolution refinement
Mask-free generation using semantic priors
🔎 Similar Papers
No similar papers found.
X
Xianbing Sun
Shanghai Jiao Tong University
Y
Yan Hong
Ant Group
J
Jiahui Zhan
Shanghai Jiao Tong University
Jun Lan
Jun Lan
Ant Group
H
Huijia Zhu
Ant Group
W
Weiqiang Wang
Ant Group
Liqing Zhang
Liqing Zhang
Professor @ Computer Science, Virginia Tech
Bioinformaticsdata analyticsmachine learning
Jianfu Zhang
Jianfu Zhang
Shanghai Jiao Tong University
Machine LearningComputer Vision