🤖 AI Summary
Existing image-based virtual try-on methods neglect the guiding role of garment attributes in geometric deformation and texture synthesis, leading to distorted clothing deformation, blurred or color-leaking textures in limb regions (e.g., arms), and particularly severe degradation under sleeve-length variation. To address these issues, we propose PL-VTON, a progressive limb-aware virtual try-on framework. PL-VTON introduces three key innovations: (1) a multi-attribute clothing warping (MCW) module for attribute-guided geometric deformation; (2) an explicit limb-aware texture fusion (LTF) mechanism to preserve limb texture fidelity; and (3) a human parsing estimator (HPE) that provides structural constraints for two-stage pixel-level alignment and semantic region modeling. Extensive experiments demonstrate that PL-VTON consistently outperforms state-of-the-art methods across multiple benchmarks. Quantitative metrics show significant improvements, while qualitative results notably enhance realism and sharpness of arm skin texture and boundary definition.
📝 Abstract
Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively.