🤖 AI Summary
Accurately detecting the precise temporal moments of hand–object contact in first-person videos is challenging due to subtle motions and frequent occlusions. This work proposes the Hand-informed Context Enhanced (HiCE) module, which integrates spatiotemporal features from both hand regions and their surrounding context, augmented with a cross-attention mechanism to model latent contact patterns. To further refine temporal discrimination, the authors introduce a grasp-aware loss function and a soft-label training strategy. Evaluated on TouchMoment—a newly curated large-scale dataset comprising 8,456 annotated contact moments—the proposed method achieves a 16.91% improvement in average precision over existing event localization approaches under a two-frame tolerance metric, substantially advancing the accuracy of contact moment detection.
📝 Abstract
We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives.
To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.