HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

218K/year
πŸ€– AI Summary
Existing 3D hand reconstruction methods struggle to balance accuracy and deployment flexibility: single-view approaches suffer from depth ambiguity and occlusion, while multi-view techniques rely heavily on camera calibration, limiting their practicality. This work proposes an end-to-end feedforward network that, for the first time, jointly infers a 3D hand mesh and camera pose directly from arbitrary uncalibrated monocular images without requiring any camera calibration. By formulating reconstruction as a visual-geometric alignment task and leveraging large-scale, in-the-wild unstructured image data from the internet, the model achieves significantly improved generalization and robustness in real-world scenarios. Experiments demonstrate that the method outperforms state-of-the-art approaches across multiple benchmarks, with particularly strong performance in uncalibrated, in-the-wild settings.

Technology Category

Application Category

πŸ“ Abstract
Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.
Problem

Research questions and friction points this paper is trying to address.

3D hand reconstruction
uncalibrated images
hand mesh
depth ambiguity
multi-view
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D hand reconstruction
uncalibrated images
camera pose estimation
visual-geometry grounding
feed-forward architecture
πŸ”Ž Similar Papers
πŸ’Ό Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69β€”$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States