🤖 AI Summary
This work addresses the challenge of generating long-duration virtual try-on videos from a single input image. Methodologically, it proposes a segmented autoregressive diffusion framework that jointly ensures local temporal smoothness and global temporal consistency. First, the long video is modeled as an autoregressive sequence of segments, where each segment is conditioned on the preceding prefix video to guarantee inter-frame local continuity. Second, a 360-degree anchored video serves as a global temporal prior, explicitly enforcing cross-segment motion coherence. Third, the framework integrates full-body geometric representations with temporal-aware diffusion modeling. To our knowledge, this is the first method enabling minute-long, high-fidelity virtual try-on video generation—preserving fine-grained texture details, physically plausible deformations, and seamless cross-segment consistency—even under complex human motions. The approach overcomes a key technical bottleneck in long-video synthesis: the joint modeling of local dynamics and global temporal structure—establishing a novel paradigm for practical virtual try-on systems.
📝 Abstract
We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video -- a 360-degree video that comprehensively captures the human's wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.