🤖 AI Summary
Existing virtual try-on evaluation suffers from three key limitations: misalignment between quantitative metrics and human perception, overreliance on indoor-scene test sets, and absence of a systematic, real-world benchmark. To address these, we propose VTBench—the first comprehensive virtual try-on benchmark designed for real-world scenarios. Our approach introduces a hierarchical, decoupled evaluation framework that quantifies performance across five dimensions: image fidelity, texture preservation, background consistency under complex scenes, cross-category size adaptation, and hand-occlusion handling. Crucially, we incorporate the first large-scale human preference annotations to bridge objective metrics with subjective perception. Methodologically, we construct a multi-granularity real-world test set and conduct cross-scenario analysis, revealing a significant performance gap between indoor and real-world settings. The full benchmark—including data, evaluation protocols, generated outputs, and human annotations—is publicly released, substantially enhancing evaluation authenticity, interpretability, and practical guidance.
📝 Abstract
While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.