OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing virtual try-on systems lack fine-grained, commercially viable evaluation benchmarks, as conventional metrics struggle to assess texture fidelity and semantic consistency. To address this gap, this work introduces a large-scale benchmark comprising approximately 100,000 high-resolution image pairs and proposes a multi-dimensional, interpretable evaluation protocol. The protocol features a novel semantic-balanced sampling strategy based on DINOv3 clustering and dense textual descriptions from Gemini, alongside multi-scale representation metrics derived from SAM3 segmentation and morphological operations that effectively disentangle boundary misalignment from internal texture distortion. This approach achieves strong alignment with human judgment (Kendall’s τ = 0.833), substantially outperforming SSIM (τ = 0.611), thereby establishing a reliable standard for evaluating virtual try-on systems.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On

evaluation benchmark

high-resolution dataset

controllable generation

fidelity assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual Try-On

High-Resolution Benchmark

Multi-Modal Evaluation