🤖 AI Summary
To address the temporal discontinuity and spatial jitter in gaze trajectories from event-camera-based eye tracking—caused by high noise and severe motion artifacts—in micro-expression recognition, this paper proposes an inference-time gaze refinement framework that requires no architectural modification or model retraining. Methodologically, it introduces two novel components: (1) motion-aware median filtering and (2) optical-flow-driven local refinement, integrated with sparse optical flow alignment and temporal trajectory modeling. A new Jitter Metric is defined to jointly quantify velocity regularity and local signal complexity. Experiments demonstrate consistent improvements across multiple baseline models: average spatial accuracy increases by 18.7%, while temporal jitter decreases by 32.4%. The refined gaze signals exhibit enhanced effectiveness and robustness for decoding micro-expressions and underlying cognitive states.
📝 Abstract
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments.