🤖 AI Summary
This work addresses the challenging problem of estimating 360° panoramic gaze targets in realistic human-computer interaction scenarios. We propose the first end-to-end, single-image-based gaze target localization method. To overcome limitations of prior approaches—such as inability to handle out-of-frame gaze and insufficient environmental context modeling—we design a conditional reasoning architecture that jointly integrates an eye-contact detector, a pre-trained vision encoder, and a multi-scale decoder to enable cross-modal feature alignment and scene-aware inference. The method supports robust zero-shot generalization to unseen environments without requiring scene priors or auxiliary sensors. Extensive experiments across diverse real-world scenes demonstrate significant improvements in gaze point localization accuracy—achieving an average 12.7% gain over state-of-the-art methods—while maintaining strong generalizability and real-time deployability. This advances foundational capabilities for downstream tasks including attention modeling and intent prediction.
📝 Abstract
Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.