Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing hand-object interaction video generation methods struggle to model dynamic contact and object deformation, particularly exhibiting poor generalization across objects of varying scales and diverse shapes. To address this, we propose the first decoupled layout representation that separately encodes hand and object poses, integrated into an adaptive layout-guided diffusion framework. Our approach introduces an interaction-aware texture enhancement module and an adaptive layout refinement strategy, coupled with dual independent memory banks and layout-decoupled modeling to enable robust layout correction during cross-object inference. The method significantly improves the physical plausibility, temporal coherence, and visual fidelity of generated videos. Extensive qualitative and quantitative evaluations demonstrate consistent and substantial superiority over state-of-the-art methods across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To cope with these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we have designed an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout-adjusting strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed framework significantly outperforms existing methods. Project page: https://fyycs.github.io/Re-HOLD.

Problem

Research questions and friction points this paper is trying to address.

Generating realistic hand-object interaction videos with diverse object sizes and shapes

Disentangling hand modeling and object adaptation for better motion sequences

Improving generation quality via interactive textural enhancement and layout adjustment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Layout-instructed Diffusion model for HOI

Interactive textural enhancement with memory banks

Layout-adjusting strategy for cross-object reenactment

🔎 Similar Papers

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

2024-05-07arXiv.orgCitations: 15