🤖 AI Summary
Existing hand-object 3D reconstruction methods rely heavily on keypoint detection, exhibiting poor generalization under weak texture, severe occlusion, and diverse object geometries. This paper proposes the first keypoint-free, end-to-end framework for estimating hand-object 3D transformations, eliminating dependencies on structure-from-motion (SfM), hand keypoint optimization, and camera intrinsics. Our method jointly leverages monocular motion video pose estimation and multi-view reconstruction, integrating differentiable rendering with self-supervised optimization to enable markerless, template-free, and uncalibrated reconstruction in generic scenes. Evaluated on the SHOWMe benchmark, our approach achieves state-of-the-art performance in joint 3D pose and shape estimation for hands and objects. Furthermore, experiments on HO3D demonstrate strong generalization to unseen object categories.
📝 Abstract
Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.