Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of efficient 3D reconstruction of dynamic hand–object interactions from monocular video by proposing a novel approach based on a compact Sum-of-Gaussians (SoG) representation. Reviving and enhancing the classical Gaussian tracking framework, the method leverages a SAM3D video adaptation pipeline to initialize object geometry and pose, and refines hand motion through 2D joint supervision and depth alignment losses. Guided by a pretrained large model, the lightweight SoG formulation circumvents computationally intensive neural optimization. On public benchmarks, the proposed approach achieves a 6.4× speedup over state-of-the-art methods while improving object reconstruction accuracy by 13.4% and reducing hand joint localization error by over 65%.

Technology Category

Application Category

📝 Abstract

We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.

Problem

Research questions and friction points this paper is trying to address.

monocular reconstruction

hand-object interaction

dynamic 3D reconstruction

temporal coherence

single-view video

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sum-of-Gaussians

monocular hand-object reconstruction

dynamic 3D interaction

efficient tracking

Gaussian-based initialization

🔎 Similar Papers

No similar papers found.