ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Virtual try-on for videos faces two key challenges: loss of garment detail and temporal inconsistency across frames. To address these, we propose a diffusion-based fine-grained temporally consistent virtual try-on framework. Our method introduces a novel spatiotemporal attention guidance mechanism: (i) region-aware spatial guidance to preserve local garment structure; (ii) attention-driven temporal feature fusion to ensure inter-frame coherence; and (iii) a multi-scale garment-pose alignment strategy to enhance geometric consistency. We construct and release StyleDress—a high-quality, diverse video-based try-on benchmark—to rigorously evaluate our approach. Extensive experiments demonstrate significant improvements in texture fidelity under dynamic motion and cross-frame temporal continuity, outperforming state-of-the-art methods across multiple quantitative metrics. Both the code and the StyleDress dataset are publicly available.

Technology Category

Application Category

📝 Abstract
Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.
Problem

Research questions and friction points this paper is trying to address.

Maintain temporal consistency in video virtual try-on
Preserve fine-grained garment details accurately
Mitigate artifacts from video dynamics effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based framework for video try-on
Spatio-temporal attention for garment details
Multi-scale features for temporal continuity
🔎 Similar Papers
No similar papers found.
Wenzhang Sun
Wenzhang Sun
Beijing Institute of Technology
3D ReconstructionAIGC
M
Ming Li
Communication University of China
Yun Zheng
Yun Zheng
Alibaba
Computer VisionMultimodal Modeling
F
Fanyao Li
Communication University of China
Z
Zhulin Tao
Communication University of China
Donglin Di
Donglin Di
Li Auto Inc.
Generative ModelsEmbodied AIMedical ImageMultimedia
H
Hao Li
Li Auto
W
Wei Chen
Li Auto
X
Xianglin Huang
Communication University of China