🤖 AI Summary
Existing methods struggle to accurately reconstruct world-coordinate hand–object interactions (W-HOI) from in-the-wild monocular egocentric videos, primarily due to insufficient temporal modeling, lack of global consistency, and limited robustness to severe camera motion and occlusions. This work proposes a multi-stage framework that first preprocesses the input video using a spatial intelligence model, then introduces a template-free, scalable decoupled diffusion model to learn a full-body hand–object interaction prior. Joint trajectories of hands and multiple objects in world coordinates are subsequently recovered through multi-object test-time optimization. To our knowledge, this is the first approach to achieve high-fidelity reconstruction of multi-object W-HOI from monocular egocentric videos in unconstrained environments. The method significantly outperforms existing approaches on standard benchmarks, demonstrating particular robustness under complex camera motion and frequent occlusions.
📝 Abstract
We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.