HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing 3D hand reconstruction methods struggle to balance accuracy and deployment flexibility: single-view approaches suffer from depth ambiguity and occlusion, while multi-view techniques rely heavily on camera calibration, limiting their practicality. This work proposes an end-to-end feedforward network that, for the first time, jointly infers a 3D hand mesh and camera pose directly from arbitrary uncalibrated monocular images without requiring any camera calibration. By formulating reconstruction as a visual-geometric alignment task and leveraging large-scale, in-the-wild unstructured image data from the internet, the model achieves significantly improved generalization and robustness in real-world scenarios. Experiments demonstrate that the method outperforms state-of-the-art approaches across multiple benchmarks, with particularly strong performance in uncalibrated, in-the-wild settings.

Technology Category

Application Category

📝 Abstract

Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.

Problem

Research questions and friction points this paper is trying to address.

3D hand reconstruction

uncalibrated images

hand mesh

depth ambiguity

multi-view

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D hand reconstruction

uncalibrated images

camera pose estimation