TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
Existing video virtual try-on methods struggle to achieve high-quality, temporally consistent results in unconstrained real-world scenarios due to the scarcity of large-scale triplet data and their reliance on precise garment masks. To address these limitations, this work proposes TripVVT, a framework built upon Diffusion Transformer that leverages stable coarse human masks instead of fine-grained garment masks and introduces video-level cross-garment supervision. We further present TripVVT-10K, the largest in-the-wild video triplet dataset to date, along with TripVVT-Bench, a comprehensive evaluation benchmark. Extensive experiments demonstrate that our approach significantly outperforms both academic and commercial systems under complex real-world conditions, achieving state-of-the-art performance in video quality, try-on fidelity, and temporal coherence. The dataset and benchmark are publicly released.
📝 Abstract
Due to the scarcity of large-scale in-the-wild triplet data and the improper use of masks, the performance of video virtual try-on models remains limited. In this paper, we first introduce **TripVVT-10K**, the largest and most diverse in-the-wild triplet dataset to date, providing explicit video-level cross-garment supervision that existing video datasets lack. Built upon this resource, we develop **TripVVT**, a Diffusion Transformer-based framework that replaces fragile garment masks with a simple, stable human-mask prior, enabling reliable background preservation while remaining robust to real-world motion, occlusion, and cluttered scenes. To support comprehensive evaluation, we further establish **TripVVT-Bench**, a 100-case benchmark covering diverse garments, complex environments, and multi-person scenarios, with metrics spanning video quality, try-on fidelity, background consistency, and temporal coherence. Compared to state-of-the-art academic and commercial systems, TripVVT achieves superior video quality and garment fidelity while markedly improving generalization to challenging in-the-wild videos. We publicly release the dataset and benchmark, which we believe provide a solid foundation for advancing controllable, realistic, and temporally stable video virtual try-on.
Problem

Research questions and friction points this paper is trying to address.

video virtual try-on
in-the-wild data
triplet dataset
garment mask
temporal coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

video virtual try-on
triplet dataset
diffusion transformer
mask-free try-on
in-the-wild benchmark
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30