HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Virtual try-on faces three core challenges: geometric distortion across poses, semantic inconsistency, and fine-detail loss. To address these, we propose a synergistic framework comprising APWAM (Pose-Aware Deformable Alignment), SRCM (Fine-Grained Semantic Representation Learning), and MPAGM (Multimodal Prior-Guided Generation), enabling the first joint modeling of geometric deformation and semantic structure consistency. We further introduce SAMP-VTONS—the first benchmark dataset explicitly designed for multi-pose evaluation—and integrate pretrained vision-language models with pose-aware deformation modeling. Extensive experiments demonstrate state-of-the-art performance on both VITON-HD and SAMP-VTONS, achieving significant improvements in image fidelity, clothing structural and textural consistency, and local detail recovery.

Technology Category

Application Category

📝 Abstract

Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.

Problem

Research questions and friction points this paper is trying to address.

Maintaining spatial consistency in virtual try-on across poses

Ensuring semantic consistency in garment structure and texture

Preserving fine-grained details for high visual fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

APWAM aligns garments to poses spatially

SRCM enhances semantic garment representation

MPAGM integrates multimodal features for generation

🔎 Similar Papers

Beyond Imperfections: A Conditional Inpainting Approach for End-to-End Artifact Removal in VTON and Pose Transfer

2024-10-05arXiv.orgCitations: 0

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)