OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing virtual try-on systems lack fine-grained, commercially viable evaluation benchmarks, as conventional metrics struggle to assess texture fidelity and semantic consistency. To address this gap, this work introduces a large-scale benchmark comprising approximately 100,000 high-resolution image pairs and proposes a multi-dimensional, interpretable evaluation protocol. The protocol features a novel semantic-balanced sampling strategy based on DINOv3 clustering and dense textual descriptions from Gemini, alongside multi-scale representation metrics derived from SAM3 segmentation and morphological operations that effectively disentangle boundary misalignment from internal texture distortion. This approach achieves strong alignment with human judgment (Kendall’s Ο„ = 0.833), substantially outperforming SSIM (Ο„ = 0.611), thereby establishing a reliable standard for evaluating virtual try-on systems.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.
Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On
evaluation benchmark
high-resolution dataset
controllable generation
fidelity assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual Try-On
High-Resolution Benchmark
Multi-Modal Evaluation
Diffusion Models
Semantic Consistency
πŸ”Ž Similar Papers
No similar papers found.
J
Jin Li
Renxing Intelligence, Hangzhou, China; Hangzhou Dianzi University, Hangzhou, China
Tao Chen
Tao Chen
Zhejiang University
Natural Language Processing
Shuai Jiang
Shuai Jiang
Google
power electronics
Weijie Wang
Weijie Wang
PhD Student, Zhejiang University
Computer VisionEfficient AIDeep Learning
J
Jingwen Luo
Renxing Intelligence, Hangzhou, China
C
Chenhui Wu
Renxing Intelligence, Hangzhou, China