GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

📅 2024-04-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing gaze target detection methods for multi-person scenarios suffer from fragmented module design and weak modeling of head–target associations. To address these issues, this paper proposes an end-to-end multi-instance gaze target detection framework. Its key contributions are: (1) the first explicit head–target connection graph model, which directly predicts structured associations between heads and objects; (2) a diffusion-model-driven scene semantic feature extraction module coupled with head feature re-injection, enabling cross-modal feature alignment; and (3) a unified, learnable graph prediction and detection joint optimization paradigm. Evaluated on the GazeFollow and MEGA benchmarks, our method significantly outperforms state-of-the-art approaches—particularly under challenging conditions involving multiple persons, occlusions, and complex backgrounds—yielding substantial improvements in localization accuracy and robustness. This work advances visual intention understanding for natural human–computer interaction.

Technology Category

Application Category

📝 Abstract
Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.
Problem

Research questions and friction points this paper is trying to address.

Detecting gaze targets accurately
Enhancing head-target association
Improving human-robot interaction cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end gaze target detection
Pre-trained diffusion model
Head-target connection map
🔎 Similar Papers
No similar papers found.