Rapidly deploying on-device eye tracking by distilling visual foundation models

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited gaze estimation accuracy of existing vision foundation models on near-eye infrared images and their poor adaptability to new hardware. The authors propose DistillGaze, a novel framework that, for the first time, applies vision foundation model distillation to on-device eye tracking. By integrating synthetic labeled data with real unlabeled data through a two-stage distillation pipeline—augmented with self-supervised learning and self-training—the method effectively bridges domain gaps and adapts to device variations. Evaluated on a dataset comprising over 2,000 participants, DistillGaze reduces the median gaze error by 58.62% compared to a purely synthetic baseline, while maintaining an ultra-compact model size of only 256K parameters, enabling real-time on-device deployment.
📝 Abstract
Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.
Problem

Research questions and friction points this paper is trying to address.

eye tracking
on-device deployment
visual foundation models
domain gap
gaze estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DistillGaze
visual foundation model distillation
on-device eye tracking
synthetic-to-real domain adaptation
gaze estimation
🔎 Similar Papers
No similar papers found.
Cheng Jiang
Cheng Jiang
Postdoc at Institut national de la recherche scientifique (INRS)
Structured illumination3D measurement3D imagingSingle-pixel imaging
J
Jogendra Kundu
Meta Reality Labs
D
David Colmenares
Meta Reality Labs
F
Fengting Yang
Meta Reality Labs
J
Joseph Robinson
Meta Reality Labs
Yatong An
Yatong An
Meta Reality Labs
3D ReconstructionOptical MeasurementComputer VisionMachine Learning
A
Ali Behrooz
Meta Reality Labs