Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hand-object interaction video generation methods struggle to simultaneously achieve scalability and interaction fidelity due to inherent limitations of 2D or 3D representations. To address this, we propose a novel structure- and contact-aware representation that requires no 3D annotations, explicitly modeling contact states, occlusion relationships, and geometric structural constraints. Our approach adopts a shared-specialized joint generation paradigm, integrating spatiotemporal consistency modeling under 2D supervision to precisely capture complex physical interactions and enable generalization to open-world scenes. Evaluated on two real-world datasets, our method significantly outperforms state-of-the-art approaches, generating physically plausible, temporally coherent, high-fidelity interaction videos. It achieves substantial improvements across three key dimensions: interaction fidelity (e.g., accurate contact localization and force alignment), dynamic coherence (e.g., smooth motion transitions and consistent object dynamics), and cross-scene generalization (e.g., robust performance under unseen object geometries and hand poses).

Technology Category

Application Category

📝 Abstract
Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic hand-object interaction videos
Addressing scalability and fidelity trade-off in HOI representation
Enabling generalization to open-world scenarios without 3D annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure and contact-aware representation without 3D annotations
Joint-generation paradigm with share-and-specialization strategy
Scalable supervision for open-world generalization
🔎 Similar Papers
No similar papers found.
Haodong Yan
Haodong Yan
PhD student of INTR, HKUST (GZ)
Human reconstructionmotion prediction
H
Hang Yu
The Hong Kong University of Science and Technology (Guangzhou)
Zhide Zhong
Zhide Zhong
Beijing Institute of Technology
Robotics
W
Weilin Yuan
The Hong Kong University of Science and Technology (Guangzhou)
X
Xin Gong
The Hong Kong University of Science and Technology (Guangzhou)
Z
Zehang Luo
The Hong Kong University of Science and Technology (Guangzhou)
C
Chengxi Heyu
The Hong Kong University of Science and Technology (Guangzhou)
J
Junfeng Li
The Hong Kong University of Science and Technology (Guangzhou)
Wenxuan Song
Wenxuan Song
The Hong Kong University of Science and Technology (Guangzhou)
Vision-language-action ModelRobotics
Shunbo Zhou
Shunbo Zhou
Huawei | The Chinese University of Hong Kong
RoboticsEmbodied AIAutonomous Navigation
Haoang Li
Haoang Li
Assistant Professor, Hong Kong University of Science and Technology (Guangzhou)
Robotics3D Computer Vision