🤖 AI Summary
This work addresses key challenges in virtual try-on—namely, inaccurate bottom garment detection, residual clothing contours, and skin reconstruction artifacts during long-to-short sleeve conversion. We propose a diffusion model optimization framework integrating multi-class clothing masks and generative skin completion. Our method introduces a novel pose- and skin-tone-aware skin inpainting module within a two-stage architecture: pre-inpainting followed by re-synthesis. A fine-grained clothing category masking mechanism is incorporated to enhance generalization across diverse garments. Compatible with mainstream diffusion models (e.g., Stable Diffusion), our approach achieves 92.5% short-sleeve synthesis accuracy on the Dress Code benchmark—surpassing Leffa by 15.4%. Visual evaluation confirms substantial improvements in texture fidelity and style consistency. The framework demonstrates strong generalizability and scalability, enabling robust adaptation to varied garment types and poses.
📝 Abstract
With the development of deep learning technology, virtual try-on technology has become an important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa has improved the texture distortion problem of diffu-sion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette remain in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing agnostic Pre-inpainting Virtual Try-ON). CaP-VTON has improved the naturalness and consistency of whole-body clothing syn-thesis by integrating multi-category masking based on Dress Code and skin inpainting based on Stable Diffusion. In particular, a generate skin module was introduced to solve the skin restoration problem that occurs when long-sleeved images are converted into short-sleeved or sleeveless ones, and high-quality restoration was implemented consider-ing the human body posture and color. As a result, CaP-VTON recorded 92.5%, which is 15.4% better than Leffa in short-sleeved synthesis accuracy, and showed the performance of consistently reproducing the style and shape of reference clothing in visual evaluation. These structures maintain model-agnostic properties and are applicable to various diffu-sion-based virtual inspection systems, and can contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.