🤖 AI Summary
Virtual try-on for videos faces two key challenges: loss of garment detail and temporal inconsistency across frames. To address these, we propose a diffusion-based fine-grained temporally consistent virtual try-on framework. Our method introduces a novel spatiotemporal attention guidance mechanism: (i) region-aware spatial guidance to preserve local garment structure; (ii) attention-driven temporal feature fusion to ensure inter-frame coherence; and (iii) a multi-scale garment-pose alignment strategy to enhance geometric consistency. We construct and release StyleDress—a high-quality, diverse video-based try-on benchmark—to rigorously evaluate our approach. Extensive experiments demonstrate significant improvements in texture fidelity under dynamic motion and cross-frame temporal continuity, outperforming state-of-the-art methods across multiple quantitative metrics. Both the code and the StyleDress dataset are publicly available.
📝 Abstract
Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.