DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address three key challenges in e-commerce human-product demonstration video generation—identity loss, spatial relationship distortion, and unnatural motion—this paper proposes a novel diffusion-based framework built upon the Diffusion Transformer (DiT). Our method introduces: (1) a pioneering dual-reference injection mechanism for both human and product, coupled with a masked cross-attention module to jointly preserve identity features; (2) joint motion guidance integrating 3D body mesh estimation and product bounding box trajectories to enforce geometric alignment between hand gestures and the product; and (3) structured textual encoding combined with hybrid data augmentation. Experiments demonstrate significant improvements over state-of-the-art methods in identity fidelity, product detail reconstruction (e.g., logos and textures), and motion naturalness. Moreover, our approach supports 3D-consistent modeling under small-angle rotations, enabling robust multi-view synthesis for e-commerce applications.

Technology Category

Application Category

📝 Abstract

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.

Problem

Research questions and friction points this paper is trying to address.

Preserve human and product identities in videos

Understand human-product spatial relationships accurately

Generate realistic demonstration motions and interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformers for video generation

Employs masked cross-attention for identity preservation

Utilizes 3D mesh for precise motion guidance

🔎 Similar Papers

Tora: Trajectory-oriented Diffusion Transformer for Video Generation