OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

228K/year
πŸ€– AI Summary
Existing virtual try-on systems lack fine-grained, commercially viable evaluation benchmarks, as conventional metrics struggle to assess texture fidelity and semantic consistency. To address this gap, this work introduces a large-scale benchmark comprising approximately 100,000 high-resolution image pairs and proposes a multi-dimensional, interpretable evaluation protocol. The protocol features a novel semantic-balanced sampling strategy based on DINOv3 clustering and dense textual descriptions from Gemini, alongside multi-scale representation metrics derived from SAM3 segmentation and morphological operations that effectively disentangle boundary misalignment from internal texture distortion. This approach achieves strong alignment with human judgment (Kendall’s Ο„ = 0.833), substantially outperforming SSIM (Ο„ = 0.611), thereby establishing a reliable standard for evaluating virtual try-on systems.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.
Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On
evaluation benchmark
high-resolution dataset
controllable generation
fidelity assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual Try-On
High-Resolution Benchmark
Multi-Modal Evaluation
Diffusion Models
Semantic Consistency