HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination

πŸ“… 2025-04-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In XR-based hand-object interaction (HOI) scenarios, low gaze estimation accuracy arises from coupled eye-hand-head motion. To address this, we propose a coordination-aware gaze estimation framework: (1) a novel eye-hand-head motion coordination metric for high-value sample selection and data denoising; (2) a hierarchical architecture that first classifies the attended hand region and then regresses gaze direction; and (3) an eye-head coordination loss jointly optimized with cross-modal Transformer, spatio-temporal graph convolutional networks (ST-GCN), and hierarchical attention to align multi-source spatiotemporal features. Evaluated on HOT3D and ADT benchmarks, our method reduces mean angular error by 15.6% and 6.0%, respectively, and significantly improves downstream eye-movement behavior recognition. This work is the first to explicitly model motion coordination for HOI gaze estimation, establishing a new paradigm for XR human-computer interaction.

Technology Category

Application Category

πŸ“ Abstract
We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.
Problem

Research questions and friction points this paper is trying to address.

Estimating gaze during hand-object interactions in XR
Exploiting eye-hand-head coordination for gaze estimation
Improving gaze estimation accuracy via denoised training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework for attended hand recognition
Cross-modal Transformers fuse head and hand features
Eye-head coordination loss improves training samples
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhiming Hu
University of Stuttgart, Germany
Daniel Haeufle
Daniel Haeufle
University of TΓΌbingen, Germany
S
Syn Schmitt
University of Stuttgart, Germany and The Center for Bionic Intelligence Tuebingen Stuttgart, Germany
Andreas Bulling
Andreas Bulling
Professor of Computer Science, University of Stuttgart
Human-Computer InteractionComputer VisionMachine LearningCollaborative AIEye Tracking