π€ AI Summary
Monocular 3D hand reconstruction suffers from severe geometric ambiguity under self-occlusion and hand-object interaction scenarios, where RGB appearance alone is insufficient to recover fine structural details. This work proposes GeoHand, a novel framework that, for the first time, integrates the frozen general-purpose monocular geometry estimator MoGe2 into hand reconstruction. GeoHand employs a GeoAdapter module for spatial feature recalibration and combines gated cross-modal fusion with a Keypoint Query Iterative Refinement (KQIR) mechanism to jointly optimize global geometric disambiguation and local joint constraints while preserving RGB-based appearance details. The method achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3 benchmarks, demonstrating particularly significant improvements over existing approaches in complex occlusion and interaction settings.
π Abstract
Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.