Dress&Dance: Dress up and Dance as You Like It - Technical Preview

πŸ“… 2025-08-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of generating high-fidelity virtual try-on videos from a single user portrait image and a reference motion video. The proposed method produces 5-second, 24-FPS, 1152Γ—720 videos supporting unified try-on for tops, bottoms, and one-piece garments. At its core lies CondNetβ€”a novel conditional network that employs cross-modal attention to jointly encode textual garment descriptions, source person features, and reference video motion representations, thereby significantly improving clothing deformation alignment and temporal motion fidelity. A multi-stage heterogeneous training strategy is further introduced to synergistically leverage limited video data and large-scale image datasets. Built upon a video diffusion framework, the approach enables end-to-end, single-step generation. Quantitative and qualitative evaluations demonstrate state-of-the-art performance across diverse garment categories, surpassing both open-source and commercial baselines in visual quality, garment realism, and motion consistency.

Technology Category

Application Category

πŸ“ Abstract
We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.
Problem

Research questions and friction points this paper is trying to address.

Generate high-quality virtual try-on videos from user images
Unify multi-modal inputs for garment registration and motion
Support diverse garment types and simultaneous try-on in single pass
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion framework for virtual try-on videos
CondNet unifies multi-modal inputs with attention
Multistage training combining video and image data
πŸ”Ž Similar Papers
No similar papers found.
Jun-Kun Chen
Jun-Kun Chen
Ph.D. Candidate of Computer Science, University of Illinois Urbana-Champaign
3D VisionNeural Radiance FieldsDiffusion ModelGenerative AIComputer Vision
A
Aayush Bansal
SpreeAI
M
Minh Phuoc Vo
SpreeAI
Y
Yu-Xiong Wang
University of Illinois Urbana-Champaign