🤖 AI Summary
Existing 3D hand reconstruction methods struggle to balance accuracy and deployment flexibility: single-view approaches suffer from depth ambiguity and occlusion, while multi-view techniques rely heavily on camera calibration, limiting their practicality. This work proposes an end-to-end feedforward network that, for the first time, jointly infers a 3D hand mesh and camera pose directly from arbitrary uncalibrated monocular images without requiring any camera calibration. By formulating reconstruction as a visual-geometric alignment task and leveraging large-scale, in-the-wild unstructured image data from the internet, the model achieves significantly improved generalization and robustness in real-world scenarios. Experiments demonstrate that the method outperforms state-of-the-art approaches across multiple benchmarks, with particularly strong performance in uncalibrated, in-the-wild settings.
📝 Abstract
Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.