InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing virtual try-on methods rely on manually designed binary masks to control generation layout, limiting fine-grained style editing (e.g., rolled sleeves, layering) and suffering from poor generalizability due to domain-specific mask construction. Method: We propose a language-driven interactive virtual try-on framework that eliminates manual masks. It synergistically leverages vision-language models (VLMs) and segmentation models for semantic-aware automatic mask generation, augmented by multi-round image-guided inpainting to support fine-grained, single- or multi-garment style control. Free-text instructions are parsed into structured editing intents, enabling dynamic generation of semantically consistent masks and refined outputs. Contribution/Results: Our method achieves state-of-the-art performance in complex outfit scenarios, enabling the first mask-free controllable style transfer in virtual try-on. It exhibits strong model compatibility and scalability, eliminating reliance on mask priors while supporting diverse, natural language–guided edits.

Technology Category

Application Category

📝 Abstract

We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with "sleeves rolled up" styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

Problem

Research questions and friction points this paper is trying to address.

Automates binary mask generation for virtual try-on using vision-language models

Enables fine-grained styling control through natural language instructions

Simplifies user experience by eliminating need for precise manual masking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated binary mask generation using VLMs

Natural language instructions guide style control

Interoperable with existing inpainting-based try-on models

🔎 Similar Papers

Beyond Imperfections: A Conditional Inpainting Approach for End-to-End Artifact Removal in VTON and Pose Transfer

2024-10-05arXiv.orgCitations: 0

Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting

2024-04-29arXiv.orgCitations: 1