HORT: Monocular Hand-held Objects Reconstruction with Transformers

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Monocular hand-held object reconstruction faces challenges including shape over-smoothing from implicit representations, inefficiency of explicit reconstruction methods, and slow high-resolution inference due to multi-step denoising in diffusion models. This paper proposes the first end-to-end Transformer-based point cloud reconstruction framework, adopting a coarse-to-fine strategy: initially generating a sparse point cloud, then progressively refining it via pixel-aligned image features. We introduce a novel hand-object joint geometric encoding mechanism that jointly estimates object shape and 6D pose. By abandoning implicit representations and iterative diffusion, our method enables efficient, explicit point cloud generation. Extensive experiments on both synthetic and real-world datasets demonstrate state-of-the-art accuracy, significantly improved inference speed, and strong generalization to in-the-wild images.

Technology Category

Application Category

📝 Abstract

Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D hand-held objects from monocular images

Overcoming slow implicit 3D representation limitations

Improving efficiency in high-resolution point cloud reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based dense 3D point cloud reconstruction

Coarse-to-fine strategy with sparse-to-dense refinement

Joint prediction using image and 3D hand geometry

🔎 Similar Papers

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild