iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenging problem of natural and photorealistic hand-object interaction (HOI) video reenactment in unconstrained, real-world scenes—where key difficulties include severe occlusion recovery, weak modeling of object deformation and physical interaction, and poor generalization across unseen objects and scenes. To this end, we propose Inp-TPU, a unified mask-filling token processing paradigm that zero-shot reuses pretrained DiT’s contextual awareness without parameter adaptation, enabling strong generalization to novel hands and objects while natively supporting long-video synthesis. Our method employs a two-stage video diffusion Transformer: the first stage generates keyframes containing target objects; the second enforces temporal consistency via spatiotemporal token reweighting and physics-guided constraints. Extensive qualitative and quantitative evaluations on complex real-world scenes demonstrate state-of-the-art performance in interaction naturalness, occlusion recovery accuracy, and motion coherence.

Technology Category

Application Category

📝 Abstract

Digital human video generation is gaining traction in fields like education and e-commerce, driven by advancements in head-body animation and lip-syncing technologies. However, realistic Hand-Object Interaction (HOI) - the complex dynamics between human hands and objects - continues to pose challenges. Generating natural and believable HOI reenactments is difficult due to issues such as occlusion between hands and objects, variations in object shapes and orientations, and the necessity for precise physical interactions, and importantly, the ability to generalize to unseen humans and objects. This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation. Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model. The first stage generates a key frame by inserting the designated object into the hand region, providing a reference for subsequent frames. The second stage ensures temporal coherence and fluidity in hand-object interactions. The key contribution of our method is to reuse the pretrained model's context perception capabilities without introducing additional parameters, enabling strong generalization to unseen objects and scenarios, and our proposed paradigm naturally supports long video generation. Comprehensive evaluations demonstrate that our approach outperforms existing methods, particularly in challenging real-world scenes, offering enhanced realism and more seamless hand-object interactions.

Problem

Research questions and friction points this paper is trying to address.

Generating realistic Hand-Object Interaction (HOI) reenactments

Addressing occlusion and object shape variations in HOI

Enhancing generalization to unseen humans and objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inpainting-based token process method

Two-stage video diffusion transformer model

Reuses pretrained model without extra parameters

🔎 Similar Papers

No similar papers found.