🤖 AI Summary
Existing virtual try-on methods suffer from low generation quality and temporal inconsistency in both image- and long-video-based scenarios. To address these limitations, we propose CatV2TON—the first diffusion Transformer framework unifying image and video try-on modeling. Our key contributions are: (1) a novel single-model cross-modal try-on architecture that jointly models human figures and garments via temporal concatenation; (2) overlapping segment-wise inference coupled with adaptive conditional normalization (AdaCN), substantially improving temporal coherence in long videos; and (3) the ViViD-S dataset—a high-quality video try-on benchmark—alongside 3D mask smoothing and multi-scale hybrid training strategies. Extensive experiments demonstrate that CatV2TON consistently outperforms state-of-the-art methods across both image and video try-on tasks, achieving superior trade-offs among generation fidelity, temporal stability, and computational efficiency.
📝 Abstract
Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.