🤖 AI Summary
This work addresses uncalibrated VR-based identity authentication. We propose the first large-scale multimodal framework that jointly models gaze trajectories and periocular images within a unified gaze estimation architecture. Our calibration-free method enables end-to-end alignment and fusion of heterogeneous ocular biometrics, simultaneously capturing temporal gaze dynamics and static periocular appearance features. Evaluated on a large-scale proprietary dataset comprising 9,202 subjects using consumer-grade VR headsets, the system demonstrates strong robustness and cross-device generalizability. Experimental results show that the multimodal approach consistently outperforms unimodal baselines across all cross-scenario evaluations; its authentication accuracy surpasses the FIDO standard and achieves state-of-the-art performance. The core contribution lies in enabling calibration-free, joint discriminative modeling of two distinct ocular biometric modalities—gaze behavior and periocular texture—within a single end-to-end trainable pipeline.
📝 Abstract
This paper investigates the feasibility of fusing two eye-centric authentication modalities-eye movements and periocular images-within a calibration-free authentication system. While each modality has independently shown promise for user authentication, their combination within a unified gaze-estimation pipeline has not been thoroughly explored at scale. In this report, we propose a multimodal authentication system and evaluate it using a large-scale in-house dataset comprising 9202 subjects with an eye tracking (ET) signal quality equivalent to a consumer-facing virtual reality (VR) device. Our results show that the multimodal approach consistently outperforms both unimodal systems across all scenarios, surpassing the FIDO benchmark. The integration of a state-of-the-art machine learning architecture contributed significantly to the overall authentication performance at scale, driven by the model's ability to capture authentication representations and the complementary discriminative characteristics of the fused modalities.