🤖 AI Summary
To address three key challenges in e-commerce human-product demonstration video generation—identity loss, spatial relationship distortion, and unnatural motion—this paper proposes a novel diffusion-based framework built upon the Diffusion Transformer (DiT). Our method introduces: (1) a pioneering dual-reference injection mechanism for both human and product, coupled with a masked cross-attention module to jointly preserve identity features; (2) joint motion guidance integrating 3D body mesh estimation and product bounding box trajectories to enforce geometric alignment between hand gestures and the product; and (3) structured textual encoding combined with hybrid data augmentation. Experiments demonstrate significant improvements over state-of-the-art methods in identity fidelity, product detail reconstruction (e.g., logos and textures), and motion naturalness. Moreover, our approach supports 3D-consistent modeling under small-angle rotations, enabling robust multi-view synthesis for e-commerce applications.
📝 Abstract
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.