🤖 AI Summary
This work addresses the challenge of face recognition using event cameras, which lack stable photometric information and thus hinder direct application of conventional methods. To bridge this gap, the study introduces the first approach that transfers spatial structural priors from the RGB domain to the event domain by proposing a Motion Prompt Encoder (MPE) and a Spatio-Temporal Modulator (STM), jointly modeling spatio-temporal identity representations driven by rigid facial motion and individual geometric structure. The authors construct EFace, the first small-scale event-based face dataset, and leverage Low-Rank Adaptation (LoRA) to transfer structural priors from pre-trained RGB face models. On EFace, the method achieves a Rank-1 identification rate of 94.19% and an Equal Error Rate of 5.35%, significantly outperforming existing approaches, while demonstrating enhanced robustness under low-light conditions and reduced template reconstructability.
📝 Abstract
Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.