GazeDETR: Gaze Detection using Disentangled Head and Gaze Representations

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing end-to-end gaze target detection models employ a single decoder to jointly model head pose estimation and gaze prediction, leading to feature entanglement and cross-task interference. To address this, we propose a dual-decoder Transformer architecture that explicitly decouples head localization from gaze mapping: a local-attention module captures fine-grained head geometry, while a global-attention module models gaze directionality and scene semantics. This design enforces explicit separation of feature representations and optimization objectives for the two subtasks. Our method achieves state-of-the-art performance on GazeFollow, VideoAttentionTarget, and ChildPlay benchmarks—significantly outperforming prior end-to-end approaches. The improved robustness and interpretability enable more reliable quantification of gaze behavior in social contexts, advancing applications in human–computer interaction and digital phenotyping.

Technology Category

Application Category

📝 Abstract

Gaze communication plays a crucial role in daily social interactions. Quantifying this behavior can help in human-computer interaction and digital phenotyping. While end-to-end models exist for gaze target detection, they only utilize a single decoder to simultaneously localize human heads and predict their corresponding gaze (e.g., 2D points or heatmap) in a scene. This multitask learning approach generates a unified and entangled representation for human head localization and gaze location prediction. Herein, we propose GazeDETR, a novel end-to-end architecture with two disentangled decoders that individually learn unique representations and effectively utilize coherent attentive fields for each subtask. More specifically, we demonstrate that its human head predictor utilizes local information, while its gaze decoder incorporates both local and global information. Our proposed architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget and ChildPlay datasets. It outperforms existing end-to-end models with a notable margin.

Problem

Research questions and friction points this paper is trying to address.

Disentangle head and gaze representations for accurate detection

Improve gaze prediction using local and global information

Achieve state-of-the-art performance on multiple gaze datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses two disentangled decoders

Separates head and gaze representations

Combines local and global information

🔎 Similar Papers

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association