🤖 AI Summary
Existing end-to-end gaze target detection models employ a single decoder to jointly model head pose estimation and gaze prediction, leading to feature entanglement and cross-task interference. To address this, we propose a dual-decoder Transformer architecture that explicitly decouples head localization from gaze mapping: a local-attention module captures fine-grained head geometry, while a global-attention module models gaze directionality and scene semantics. This design enforces explicit separation of feature representations and optimization objectives for the two subtasks. Our method achieves state-of-the-art performance on GazeFollow, VideoAttentionTarget, and ChildPlay benchmarks—significantly outperforming prior end-to-end approaches. The improved robustness and interpretability enable more reliable quantification of gaze behavior in social contexts, advancing applications in human–computer interaction and digital phenotyping.
📝 Abstract
Gaze communication plays a crucial role in daily social interactions. Quantifying this behavior can help in human-computer interaction and digital phenotyping. While end-to-end models exist for gaze target detection, they only utilize a single decoder to simultaneously localize human heads and predict their corresponding gaze (e.g., 2D points or heatmap) in a scene. This multitask learning approach generates a unified and entangled representation for human head localization and gaze location prediction. Herein, we propose GazeDETR, a novel end-to-end architecture with two disentangled decoders that individually learn unique representations and effectively utilize coherent attentive fields for each subtask. More specifically, we demonstrate that its human head predictor utilizes local information, while its gaze decoder incorporates both local and global information. Our proposed architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget and ChildPlay datasets. It outperforms existing end-to-end models with a notable margin.