iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

๐Ÿ“… 2025-06-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenging problem of natural and photorealistic hand-object interaction (HOI) video reenactment in unconstrained, real-world scenesโ€”where key difficulties include severe occlusion recovery, weak modeling of object deformation and physical interaction, and poor generalization across unseen objects and scenes. To this end, we propose Inp-TPU, a unified mask-filling token processing paradigm that zero-shot reuses pretrained DiTโ€™s contextual awareness without parameter adaptation, enabling strong generalization to novel hands and objects while natively supporting long-video synthesis. Our method employs a two-stage video diffusion Transformer: the first stage generates keyframes containing target objects; the second enforces temporal consistency via spatiotemporal token reweighting and physics-guided constraints. Extensive qualitative and quantitative evaluations on complex real-world scenes demonstrate state-of-the-art performance in interaction naturalness, occlusion recovery accuracy, and motion coherence.

Technology Category

Application Category

๐Ÿ“ Abstract
Digital human video generation is gaining traction in fields like education and e-commerce, driven by advancements in head-body animation and lip-syncing technologies. However, realistic Hand-Object Interaction (HOI) - the complex dynamics between human hands and objects - continues to pose challenges. Generating natural and believable HOI reenactments is difficult due to issues such as occlusion between hands and objects, variations in object shapes and orientations, and the necessity for precise physical interactions, and importantly, the ability to generalize to unseen humans and objects. This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation. Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model. The first stage generates a key frame by inserting the designated object into the hand region, providing a reference for subsequent frames. The second stage ensures temporal coherence and fluidity in hand-object interactions. The key contribution of our method is to reuse the pretrained model's context perception capabilities without introducing additional parameters, enabling strong generalization to unseen objects and scenarios, and our proposed paradigm naturally supports long video generation. Comprehensive evaluations demonstrate that our approach outperforms existing methods, particularly in challenging real-world scenes, offering enhanced realism and more seamless hand-object interactions.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic Hand-Object Interaction (HOI) reenactments
Addressing occlusion and object shape variations in HOI
Enhancing generalization to unseen humans and objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inpainting-based token process method
Two-stage video diffusion transformer model
Reuses pretrained model without extra parameters
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhelun Shen
Department of Computer Vision Technology(VIS), Baidu Inc., China
Chenming Wu
Chenming Wu
Researcher, Baidu Inc.
RoboticsGraphics3D VisionComputational Design
Junsheng Zhou
Junsheng Zhou
Tsinghua University
3D computer vision
C
Chen Zhao
Department of Computer Vision Technology(VIS), Baidu Inc., China
K
Kaisiyuan Wang
Department of Computer Vision Technology(VIS), Baidu Inc., China
H
Hang Zhou
Department of Computer Vision Technology(VIS), Baidu Inc., China
Yingying Li
Yingying Li
UIUC
online controllearning-based controlonline learningsafe learningdistributed control
Haocheng Feng
Haocheng Feng
Baidu
computer vision
W
Wei He
Department of Computer Vision Technology(VIS), Baidu Inc., China
J
Jingdong Wang
Department of Computer Vision Technology(VIS), Baidu Inc., China