Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses unsupervised 3D object reconstruction from first-person videos during hand–object interactions. We introduce the novel task of “Reconstructing Objects along the Hand Interaction Timeline” (ROHIT), focusing specifically on stable grasping phases to model the temporal evolution of object pose sequences. We formally define the Hand Interaction Timeline (HIT)—a temporally ordered sequence of hand–object contact states—and establish a 3D-ground-truth-free framework for evaluation and optimization. To estimate temporally consistent object poses, we propose Constraint Optimization and Propagation (COP), jointly minimizing 2D projection errors, enforcing continuous contact constraints, and incorporating static-object priors. Experiments on HOT3D and EPIC-Kitchens demonstrate that COP reduces reconstruction error during stable grasping by 6.2–11.3% and improves overall HIT pose accuracy by up to 24.5%, significantly advancing dynamic object reconstruction under no ground-truth supervision.

Technology Category

Application Category

📝 Abstract
We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing object poses along hand interaction timelines in egocentric videos
Modeling pose constraints during stable grasps without 3D ground truth
Improving reconstruction accuracy using constrained optimization and propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained Optimisation and Propagation framework for reconstruction
Modeling pose constraints along Hand Interaction Timeline
Focusing on stable grasp timelines for annotation efficiency
🔎 Similar Papers
No similar papers found.