Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Virtual Try-On (VTON) suffers from poor pose fidelity, reliance on auxiliary modules (e.g., dedicated encoders or control networks), and difficulty in precise pose guidance. Method: This paper proposes a lightweight, end-to-end pose-fusion approach that eliminates extra modules by performing channel-wise spatial concatenation of pose maps and garment images for direct pose guidance. It introduces pose map representation learning and jointly trains with fine-grained segmentation masks and bounding-box masks to balance pose consistency and garment deformation flexibility. Contribution/Results: Experiments demonstrate significant improvements in pose preservation accuracy and visual realism of synthesized images. The method achieves high-quality try-on results across diverse and complex human poses, establishing a parameter-free, concise, and effective pose-control paradigm for end-to-end VTON.

Technology Category

Application Category

📝 Abstract

As online shopping continues to grow, the demand for Virtual Try-On (VTON) technology has surged, allowing customers to visualize products on themselves by overlaying product images onto their own photos. An essential yet challenging condition for effective VTON is pose control, which ensures accurate alignment of products with the user's body while supporting diverse orientations for a more immersive experience. However, incorporating pose conditions into VTON models presents several challenges, including selecting the optimal pose representation, integrating poses without additional parameters, and balancing pose preservation with flexible pose control. In this work, we build upon a baseline VTON model that concatenates the reference image condition without external encoder, control network, or complex attention layers. We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data, comparing performance using pose maps and skeletons, without adding any additional parameters or module to the baseline model. Our experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism. Additionally, we introduce a mixed-mask training strategy using fine-grained and bounding box masks, allowing the model to support flexible product integration across varied poses and conditions.

Problem

Research questions and friction points this paper is trying to address.

Incorporating pose control into Virtual Try-On without adding parameters

Selecting optimal pose representation for accurate product alignment

Balancing pose preservation with flexible pose control in VTON

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose control via spatial concatenation without parameters

Pose maps outperform skeletons for alignment realism

Mixed-mask training enables flexible product integration

🔎 Similar Papers

Beyond Imperfections: A Conditional Inpainting Approach for End-to-End Artifact Removal in VTON and Pose Transfer

2024-10-05arXiv.orgCitations: 0

ByteDance

San Jose

Research Engineer, Global E-commerce

TikTok

San Jose, California

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)