HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hand-object 3D reconstruction methods rely heavily on keypoint detection, exhibiting poor generalization under weak texture, severe occlusion, and diverse object geometries. This paper proposes the first keypoint-free, end-to-end framework for estimating hand-object 3D transformations, eliminating dependencies on structure-from-motion (SfM), hand keypoint optimization, and camera intrinsics. Our method jointly leverages monocular motion video pose estimation and multi-view reconstruction, integrating differentiable rendering with self-supervised optimization to enable markerless, template-free, and uncalibrated reconstruction in generic scenes. Evaluated on the SHOWMe benchmark, our approach achieves state-of-the-art performance in joint 3D pose and shape estimation for hands and objects. Furthermore, experiments on HO3D demonstrate strong generalization to unseen object categories.

Technology Category

Application Category

📝 Abstract
Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D hand-object interactions without keypoint detection
Overcoming limitations with diverse geometries and occlusions
Enabling template-free reconstruction from monocular RGB videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keypoint-free hand-object 3D transformation estimation
Monocular motion video integration with reconstruction pipeline
Template-free approach without camera intrinsics requirement
🔎 Similar Papers
No similar papers found.
A
Anilkumar Swamy
NAVER LABS Europe, Inria centre at the University Grenoble Alpes
V
Vincent Leroy
NAVER LABS Europe
Philippe Weinzaepfel
Philippe Weinzaepfel
Principal Research Scientist, Naver Labs Europe
Computer VisionDeep Learning
J
Jean-Sébastien Franco
Inria centre at the University Grenoble Alpes
Grégory Rogez
Grégory Rogez
Research Scientist, NAVER LABS Europe
Computer VisionPattern RecognitionMachine Learning