🤖 AI Summary
To address the lack of force perception modeling and cross-view transfer research in articulated object manipulation, this paper introduces HOI-Force—the first force-grounded, multi-view synchronized multimodal manipulation dataset. HOI-Force comprises 3,048 manipulation sequences across 381 objects and 38 environments, enabling high-precision temporal synchronization of visual, six-degree-of-freedom (6-DoF) force, and tactile signals across four distinct manipulation modalities: human hands, wrist-mounted cameras, UMI robotic grippers, and a custom Hoi! gripper. Its core contributions are threefold: (1) incorporation of real-world force annotations spanning diverse embodiment modalities; (2) support for force prediction, cross-view imitation learning, and joint visuo-tactile-force representation learning; and (3) establishment of the largest and most modally comprehensive benchmark to date for articulated object manipulation. The dataset and code are publicly released.
📝 Abstract
We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.