iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing video virtual try-on methods struggle to model the active interaction between humans and garments, resulting in static and uncontrollable dressing processes. This work formalizes, for the first time, the task of interactive video virtual try-on and introduces iTryOn, a framework based on a video diffusion Transformer. iTryOn incorporates a multi-level interaction injection mechanism that integrates garment-agnostic 3D hand priors, global and timestamp-specific action captions, and an action-aware rotational positional encoding (A-RoPE) to accurately capture sparse and transient human-garment interactions. Experiments demonstrate that iTryOn achieves state-of-the-art performance on both conventional and interactive try-on benchmarks, significantly enhancing the realism of generated videos and user controllability.

📝 Abstract

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

Problem

Research questions and friction points this paper is trying to address.

Interactive Video Virtual Try-On

human-garment interaction

semantic ambiguity

garment deformation

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive Video Virtual Try-On

Spatial-Semantic Guidance

3D Hand Prior

Action-aware Rotational Position Embedding

Video Diffusion Transformer

🔎 Similar Papers

No similar papers found.