EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to accurately reconstruct world-coordinate hand–object interactions (W-HOI) from in-the-wild monocular egocentric videos, primarily due to insufficient temporal modeling, lack of global consistency, and limited robustness to severe camera motion and occlusions. This work proposes a multi-stage framework that first preprocesses the input video using a spatial intelligence model, then introduces a template-free, scalable decoupled diffusion model to learn a full-body hand–object interaction prior. Joint trajectories of hands and multiple objects in world coordinates are subsequently recovered through multi-object test-time optimization. To our knowledge, this is the first approach to achieve high-fidelity reconstruction of multi-object W-HOI from monocular egocentric videos in unconstrained environments. The method significantly outperforms existing approaches on standard benchmarks, demonstrating particular robustness under complex camera motion and frequent occlusions.

Technology Category

Application Category

📝 Abstract
We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.
Problem

Research questions and friction points this paper is trying to address.

world-space hand-object interaction
egocentric video
monocular reconstruction
camera motion
occlusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

world-space hand-object interaction
egocentric video
decoupled diffusion model
template-free HOI prior
test-time optimization
🔎 Similar Papers
No similar papers found.