GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing gaze understanding methods decouple gaze target detection from gaze following prediction and heavily rely on head pose priors, requiring auxiliary networks to extract head information—hindering end-to-end joint optimization and limiting deployment flexibility. To address this, we propose GaTector+, the first unified framework that operates without head pose input during inference. It employs a shared backbone, a dedicated head-branch detector, and a head-based attention fusion mechanism to enable synergistic optimization of both tasks. We further introduce an attention supervision strategy and a novel evaluation metric, mean Softness-over-Confidence (mSoC), to enhance localization accuracy and robustness. Extensive experiments demonstrate that GaTector+ significantly outperforms state-of-the-art methods across multiple benchmarks, validating the effectiveness and practicality of head-pose-free design.

Technology Category

Application Category

📝 Abstract

Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

Problem

Research questions and friction points this paper is trying to address.

Unified framework eliminates head dependency in gaze prediction

Joint optimization addresses separate gaze object and following tasks

Novel evaluation metric improves bounding box sensitivity assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework eliminates head-related prior dependency

Expanded specific-general-specific feature extractor with shared backbone

Head-based attention mechanism fuses visual and gaze features

🔎 Similar Papers

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association