DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in e-commerce human-product demonstration video generation—identity loss, spatial relationship distortion, and unnatural motion—this paper proposes a novel diffusion-based framework built upon the Diffusion Transformer (DiT). Our method introduces: (1) a pioneering dual-reference injection mechanism for both human and product, coupled with a masked cross-attention module to jointly preserve identity features; (2) joint motion guidance integrating 3D body mesh estimation and product bounding box trajectories to enforce geometric alignment between hand gestures and the product; and (3) structured textual encoding combined with hybrid data augmentation. Experiments demonstrate significant improvements over state-of-the-art methods in identity fidelity, product detail reconstruction (e.g., logos and textures), and motion naturalness. Moreover, our approach supports 3D-consistent modeling under small-angle rotations, enabling robust multi-view synthesis for e-commerce applications.

Technology Category

Application Category

📝 Abstract
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.
Problem

Research questions and friction points this paper is trying to address.

Preserve human and product identities in videos
Understand human-product spatial relationships accurately
Generate realistic demonstration motions and interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformers for video generation
Employs masked cross-attention for identity preservation
Utilizes 3D mesh for precise motion guidance
🔎 Similar Papers
No similar papers found.
L
Lizhen Wang
ByteDance Intelligent Creation
Z
Zhurong Xia
ByteDance Intelligent Creation
T
Tianshu Hu
ByteDance Intelligent Creation
P
Pengrui Wang
ByteDance Intelligent Creation
P
Pengfei Wang
ByteDance Intelligent Creation
Zerong Zheng
Zerong Zheng
Bytedance
Computer VisionComputer Graphics
M
Ming Zhou
ByteDance Intelligent Creation