CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

To address the challenges of real-time, robust gaze estimation under unconstrained conditions, this work proposes a spatiotemporal disentangled architecture based on capsule networks. Methodologically, it integrates a ConvNeXt backbone with capsule formation mechanisms and introduces attention-based routing to explicitly model part-whole relationships; a dual-GRU decoder is designed to separately capture fast and slow eye movement dynamics, enhancing both interpretability and generalization of temporal modeling. The method achieves state-of-the-art accuracy on ETH-XGaze (3.36°), MPIIFaceGaze (2.65°), Gaze360 (9.06°), and RT-GENE (4.76°), while enabling sub-10 ms single-frame inference and requiring significantly fewer parameters than comparable models. Its core contribution lies in the first synergistic integration of capsule representations, attention-based routing, and temporally disentangled GRUs for gaze estimation—thereby jointly optimizing efficiency, robustness, and interpretability.

Technology Category

Application Category

📝 Abstract

We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare

Problem

Research questions and friction points this paper is trying to address.

Developing robust real-time gaze estimation for interactive systems

Modeling both slow and rapid gaze dynamics efficiently

Achieving generalization across diverse unconstrained gaze conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Capsule-based architecture with attention routing

Dual GRU decoders for slow and rapid dynamics

ConvNeXt backbone enabling efficient part-whole reasoning

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

Master Thesis Bridging the Gap between Reinforcement Learning & E2E Driving

Bosch Group

Renningen, BW, DE

Research Engineer - Perception and Machine Learning