Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

📅 2024-12-13

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Video virtual try-on faces two key challenges: difficulty preserving fine garment details and temporal inconsistency of human body parts during rapid motion, compounded by high computational overhead in existing methods. This paper proposes the first lightweight and efficient framework based on the Diffusion Transformer (DiT), where the DiT backbone jointly serves as both the garment encoder and the generative core. We introduce a limb-aware dynamic attention mechanism that explicitly models joint motion trajectories to ensure temporal consistency of limbs across frames. Additionally, a dynamic feature fusion module is designed to concentrate denoising efforts on critical regions, significantly reducing computational redundancy. Experiments demonstrate that our method generates high-fidelity, texture-preserving, and temporally coherent try-on videos under complex poses, while achieving superior inference efficiency compared to state-of-the-art approaches relying on separate garment encoders.

Technology Category

Application Category

📝 Abstract

Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder's capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in video try-on systems

Ensuring temporal consistency during complex human movements

Preserving clothing details without additional garment encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformer as garment encoder

Dynamic feature fusion module integrates features

Limb-aware attention ensures temporal consistency

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs