🤖 AI Summary
Existing gaze target detection methods for multi-person scenarios suffer from fragmented module design and weak modeling of head–target associations. To address these issues, this paper proposes an end-to-end multi-instance gaze target detection framework. Its key contributions are: (1) the first explicit head–target connection graph model, which directly predicts structured associations between heads and objects; (2) a diffusion-model-driven scene semantic feature extraction module coupled with head feature re-injection, enabling cross-modal feature alignment; and (3) a unified, learnable graph prediction and detection joint optimization paradigm. Evaluated on the GazeFollow and MEGA benchmarks, our method significantly outperforms state-of-the-art approaches—particularly under challenging conditions involving multiple persons, occlusions, and complex backgrounds—yielding substantial improvements in localization accuracy and robustness. This work advances visual intention understanding for natural human–computer interaction.
📝 Abstract
Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.