Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating long-duration virtual try-on videos from a single input image. Methodologically, it proposes a segmented autoregressive diffusion framework that jointly ensures local temporal smoothness and global temporal consistency. First, the long video is modeled as an autoregressive sequence of segments, where each segment is conditioned on the preceding prefix video to guarantee inter-frame local continuity. Second, a 360-degree anchored video serves as a global temporal prior, explicitly enforcing cross-segment motion coherence. Third, the framework integrates full-body geometric representations with temporal-aware diffusion modeling. To our knowledge, this is the first method enabling minute-long, high-fidelity virtual try-on video generation—preserving fine-grained texture details, physically plausible deformations, and seamless cross-segment consistency—even under complex human motions. The approach overcomes a key technical bottleneck in long-video synthesis: the joint modeling of local dynamics and global temporal structure—establishing a novel paradigm for practical virtual try-on systems.

Technology Category

Application Category

📝 Abstract
We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video -- a 360-degree video that comprehensively captures the human's wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.
Problem

Research questions and friction points this paper is trying to address.

Generating arbitrarily long virtual try-on videos
Ensuring local smoothness between video segments
Maintaining global temporal consistency across segments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-regressive segment-by-segment generation process
Prefix video condition ensures local smoothness
Anchor video enforces global temporal consistency
🔎 Similar Papers
No similar papers found.
Jun-Kun Chen
Jun-Kun Chen
Ph.D. Candidate of Computer Science, University of Illinois Urbana-Champaign
3D VisionNeural Radiance FieldsDiffusion ModelGenerative AIComputer Vision
A
Aayush Bansal
SpreeAI
M
Minh Phuoc Vo
SpreeAI
Y
Yu-Xiong Wang
University of Illinois Urbana-Champaign