GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for hand–object interaction video generation often suffer from inconsistent object identity and physically implausible contact in complex in-the-wild scenes, exhibiting limited generalization. This work proposes a lightweight enhancement strategy that builds upon pretrained video generation models by introducing Head-Sliding RoPE to enable temporally balanced object representation injection. Furthermore, a two-stage spatial attention gating mechanism is designed to precisely localize interaction regions and adaptively modulate their influence strength. The proposed approach significantly improves temporal consistency and spatial realism of generated videos, outperforming state-of-the-art hand–object interaction replay and general video editing methods on unseen in-the-wild scenarios while preserving object identity coherence and background fidelity.

Technology Category

Application Category

📝 Abstract
Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/
Problem

Research questions and friction points this paper is trying to address.

Hand-Object Interaction
Object Consistency
Video Synthesis
Temporal Coherence
In-the-wild Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hand-Object Interaction
Temporal Consistency
Spatial Attention
Video Generation
Object Injection
🔎 Similar Papers
No similar papers found.
X
Xuan Huang
Department of Computer Vision Technology(VIS), Baidu Inc., China; Shenzhen Campus of Sun Yat-Sen University, China
Mochu Xiang
Mochu Xiang
Northwestern Polytechnical University
Monocular Depth Estimation
Z
Zhelun Shen
Department of Computer Vision Technology(VIS), Baidu Inc., China
Jinbo Wu
Jinbo Wu
Baidu Inc.
3D VisionComputer Graphics
Chenming Wu
Chenming Wu
Researcher, Baidu Inc.
RoboticsGraphics3D VisionComputational Design
C
Chen Zhao
Department of Computer Vision Technology(VIS), Baidu Inc., China
K
Kaisiyuan Wang
Department of Computer Vision Technology(VIS), Baidu Inc., China
Hang Zhou
Hang Zhou
Baidu Inc.
Computer VisionAudio ProcessingMultimodal Learning
S
Shanshan Liu
Department of Computer Vision Technology(VIS), Baidu Inc., China
Haocheng Feng
Haocheng Feng
Baidu
computer vision
Wei He
Wei He
Baidu
Natural Language Processing
J
Jingdong Wang
Department of Computer Vision Technology(VIS), Baidu Inc., China