Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing virtual try-on and clothing removal datasets lack fine-grained control and interactive capabilities guided by natural language instructions. To address this limitation, this work proposes Dress-ED—the first large-scale multimodal benchmark that supports text-guided, unified try-on, take-off, and garment editing. Dress-ED encompasses three clothing categories and seven types of appearance- or structure-based edits, comprising 146,000 high-quality validated samples. We introduce a multimodal diffusion framework that jointly interprets linguistic instructions and visual cues, integrating multimodal large language models (MLLMs) for garment understanding, diffusion models for image generation, and large language models (LLMs) for instruction validation to enable fully automated data generation and training. This work establishes a strong baseline and a novel paradigm for instruction-driven virtual dressing tasks.

Technology Category

Application Category

📝 Abstract

Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On

Virtual Try-Off

instruction-guided editing

fashion synthesis

garment editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-guided editing

virtual try-on

virtual try-off

multimodal diffusion

fashion synthesis

🔎 Similar Papers

No similar papers found.