🤖 AI Summary
Existing methods for hand–object interaction video generation often suffer from inconsistent object identity and physically implausible contact in complex in-the-wild scenes, exhibiting limited generalization. This work proposes a lightweight enhancement strategy that builds upon pretrained video generation models by introducing Head-Sliding RoPE to enable temporally balanced object representation injection. Furthermore, a two-stage spatial attention gating mechanism is designed to precisely localize interaction regions and adaptively modulate their influence strength. The proposed approach significantly improves temporal consistency and spatial realism of generated videos, outperforming state-of-the-art hand–object interaction replay and general video editing methods on unseen in-the-wild scenarios while preserving object identity coherence and background fidelity.
📝 Abstract
Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/