🤖 AI Summary
This work addresses the novel task of gaze-guided hand-object interaction (HOI) synthesis, tackling the fundamental challenge of jointly ensuring gaze ambiguity resolution and motion naturalness. To this end, we introduce GazeHOI—the first dataset featuring synchronized 3D gaze, hand, and object pose annotations. We propose GHO-Diffusion, a stacked diffusion model incorporating an HOI manifold-guided sampling mechanism and a gaze-contact/interaction consistency scoring strategy; additionally, we design spatiotemporal gaze feature encoding to enhance generation controllability and physical plausibility. Experiments on GazeHOI demonstrate that our method significantly outperforms all baselines, producing motions that exhibit temporal coherence, physical feasibility, and consistent alignment among gaze, hand, and object—thereby establishing a robust foundation for interactive synthesis in AR/VR and assistive technologies.
📝 Abstract
Gaze plays a crucial role in revealing human attention and intention, particularly in hand-object interaction scenarios, where it guides and synchronizes complex tasks that require precise coordination between the brain, hand, and object. Motivated by this, we introduce a novel task: Gaze-Guided Hand-Object Interaction Synthesis, with potential applications in augmented reality, virtual reality, and assistive technologies. To support this task, we present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions. This task poses significant challenges due to the inherent sparsity and noise in gaze data, as well as the need for high consistency and physical plausibility in generating hand and object motions. To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. The stacked design effectively reduces the complexity of motion generation. We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions while maintaining the data manifold. Additionally, we propose a spatial-temporal gaze feature encoding for the diffusion condition and select diffusion results based on consistency scores between gaze-contact maps and gaze-interaction trajectories. Extensive experiments highlight the effectiveness of our method and the unique contributions of our dataset. More details in https://takiee.github.io/gaze-hoi/.