๐ค AI Summary
Eye-tracking for AR/VR wearables demands ultra-low latency (<10 ms) and milliwatt-level power consumption, yet conventional frame-based approaches suffer from motion blur, high computational overhead, and insufficient temporal resolution. This paper introduces the first lightweight event-driven spiking neural network (SNN) for eye-tracking: it replaces RNNs and attention modules with leaky integrate-and-fire (LIF) neurons, and integrates neuromorphic event cameras with depthwise separable convolutions to achieve millisecond-scale pupil localization. The model achieves high accuracyโmean squared error of 3.7โ4.1 pixels, approaching that of the specialized Retina system (3.24 pixels)โwhile reducing model size by 20ร, theoretical computation by 850ร, and power consumption to only 3.9โ4.9 mW. End-to-end latency is as low as 3 ms at 1 kHz. To our knowledge, this is the first work to jointly optimize high performance, ultra-low power, and high temporal fidelity in neuromorphic eye-tracking.
๐ Abstract
Eye tracking for wearable systems demands low latency and milliwatt-level power, but conventional frame-based pipelines struggle with motion blur, high compute cost, and limited temporal resolution. Such capabilities are vital for enabling seamless and responsive interaction in emerging technologies like augmented reality (AR) and virtual reality (VR), where understanding user gaze is key to immersion and interface design. Neuromorphic sensors and spiking neural networks (SNNs) offer a promising alternative, yet existing SNN approaches are either too specialized or fall short of the performance of modern ANN architectures. This paper presents a neuromorphic version of top-performing event-based eye-tracking models, replacing their recurrent and attention modules with lightweight LIF layers and exploiting depth-wise separable convolutions to reduce model complexity. Our models obtain 3.7-4.1px mean error, approaching the accuracy of the application-specific neuromorphic system, Retina (3.24px), while reducing model size by 20x and theoretical compute by 850x, compared to the closest ANN variant of the proposed model. These efficient variants are projected to operate at an estimated 3.9-4.9 mW with 3 ms latency at 1 kHz. The present results indicate that high-performing event-based eye-tracking architectures can be redesigned as SNNs with substantial efficiency gains, while retaining accuracy suitable for real-time wearable deployment.