🤖 AI Summary
Existing gaze estimation methods suffer from interference caused by irrelevant factors—such as facial expressions, illumination variations, and occlusions—in complex face images. To address this, we propose a disentangled gaze estimation framework featuring three key innovations: (1) a novel continuous masking-based disentangler that reconstructs eye-region and non-eye-region features via dual parallel branches; (2) a cascaded multi-scale global–local attention module (MS-GLAM) that jointly models cross-scale global semantics and fine-grained eye-region details; and (3) explicit integration of head pose features, optimized through end-to-end joint training to enhance robustness. Our method achieves state-of-the-art performance on the MPIIGaze and EyeDiap benchmarks, with significant improvements in challenging scenarios—including low-resolution inputs, cross-domain settings, and partial occlusions—demonstrating the effectiveness of disentangled representation learning coupled with multi-scale attention modeling.
📝 Abstract
Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.