🤖 AI Summary
To address insufficient pose and shape estimation accuracy and weak visual realism in monocular 3D hand reconstruction, this paper proposes a texture-aligned supervised learning framework. The core innovation lies in leveraging texture as a dense spatial prior: differentiable rendering enables pixel-level appearance alignment between the predicted mesh and the input RGB image. We design a lightweight texture module and a UV-space dense alignment loss, rendering texture an plug-and-play active supervision signal. Crucially, our method requires no additional annotations—geometry is optimized solely from RGB images. Integrated into mainstream frameworks such as HaMeR, it achieves significant improvements: 12.3% reduction in MPJPE and enhanced texture-geometry consistency. Extensive experiments demonstrate state-of-the-art performance in both visual realism and quantitative metrics across multiple benchmarks.
📝 Abstract
We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Our observation is simple: even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment may be an underused supervisory signal. We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a novel dense alignment loss between predicted and observed hand appearances. Our approach assumes access to a differentiable rendering pipeline and a model that maps images to 3D hand meshes with known topology, allowing us to back-project a textured hand onto the image and perform pixel-based alignment. The module is self-contained and easily pluggable into existing reconstruction pipelines. To isolate and highlight the value of texture-guided supervision, we augment HaMeR, a high-performing yet unadorned transformer architecture for 3D hand pose estimation. The resulting system improves both accuracy and realism, demonstrating the value of appearance-guided alignment in hand reconstruction.