🤖 AI Summary
Virtual try-on faces three core challenges: geometric distortion across poses, semantic inconsistency, and fine-detail loss. To address these, we propose a synergistic framework comprising APWAM (Pose-Aware Deformable Alignment), SRCM (Fine-Grained Semantic Representation Learning), and MPAGM (Multimodal Prior-Guided Generation), enabling the first joint modeling of geometric deformation and semantic structure consistency. We further introduce SAMP-VTONS—the first benchmark dataset explicitly designed for multi-pose evaluation—and integrate pretrained vision-language models with pose-aware deformation modeling. Extensive experiments demonstrate state-of-the-art performance on both VITON-HD and SAMP-VTONS, achieving significant improvements in image fidelity, clothing structural and textural consistency, and local detail recovery.
📝 Abstract
Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.