🤖 AI Summary
To address the challenges of real-time, robust gaze estimation under unconstrained conditions, this work proposes a spatiotemporal disentangled architecture based on capsule networks. Methodologically, it integrates a ConvNeXt backbone with capsule formation mechanisms and introduces attention-based routing to explicitly model part-whole relationships; a dual-GRU decoder is designed to separately capture fast and slow eye movement dynamics, enhancing both interpretability and generalization of temporal modeling. The method achieves state-of-the-art accuracy on ETH-XGaze (3.36°), MPIIFaceGaze (2.65°), Gaze360 (9.06°), and RT-GENE (4.76°), while enabling sub-10 ms single-frame inference and requiring significantly fewer parameters than comparable models. Its core contribution lies in the first synergistic integration of capsule representations, attention-based routing, and temporally disentangled GRUs for gaze estimation—thereby jointly optimizing efficiency, robustness, and interpretability.
📝 Abstract
We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare