CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
To address the challenges of real-time, robust gaze estimation under unconstrained conditions, this work proposes a spatiotemporal disentangled architecture based on capsule networks. Methodologically, it integrates a ConvNeXt backbone with capsule formation mechanisms and introduces attention-based routing to explicitly model part-whole relationships; a dual-GRU decoder is designed to separately capture fast and slow eye movement dynamics, enhancing both interpretability and generalization of temporal modeling. The method achieves state-of-the-art accuracy on ETH-XGaze (3.36°), MPIIFaceGaze (2.65°), Gaze360 (9.06°), and RT-GENE (4.76°), while enabling sub-10 ms single-frame inference and requiring significantly fewer parameters than comparable models. Its core contribution lies in the first synergistic integration of capsule representations, attention-based routing, and temporally disentangled GRUs for gaze estimation—thereby jointly optimizing efficiency, robustness, and interpretability.

Technology Category

Application Category

📝 Abstract
We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare
Problem

Research questions and friction points this paper is trying to address.

Developing robust real-time gaze estimation for interactive systems
Modeling both slow and rapid gaze dynamics efficiently
Achieving generalization across diverse unconstrained gaze conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Capsule-based architecture with attention routing
Dual GRU decoders for slow and rapid dynamics
ConvNeXt backbone enabling efficient part-whole reasoning
🔎 Similar Papers
No similar papers found.