Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Face landmark alignment with event cameras remains challenging under low-light and fast-motion conditions—existing RGB-based methods generalize poorly, while purely event-based approaches suffer from spatial sparsity and lack of annotated data. To address this, we propose an RGB-guided self-supervised event representation learning framework. Our key contributions are: (1) a cross-modal fusion attention (CMFA) mechanism that leverages RGB priors to guide event feature extraction; (2) self-supervised multi-event representation learning (SSMER), which mines spatiotemporal structure from unlabeled event streams to mitigate sparsity and annotation bottlenecks; and (3) an end-to-end event encoding and alignment network. Evaluated on our newly constructed real-world E-SIE dataset and the synthetic WFLW-V benchmark, our method achieves state-of-the-art performance, attaining superior accuracy, robustness, and generalization—as evidenced by leading results on NME and PCK metrics.

Technology Category

Application Category

📝 Abstract
Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Aligning facial keypoints using event camera data
Overcoming limited spatial information in event data
Addressing lack of labeled event datasets for training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal fusion attention integrates RGB data
Self-supervised learning overcomes spatial limitations
Multi-event representation learning extracts robust features
🔎 Similar Papers
No similar papers found.
Donghwa Kang
Donghwa Kang
KAIST
DNNReal Time SystemSNNAI Security
J
Junho Kim
School of Electronic and Electrical Engineering, Hongik University, Seoul 04066, South Korea
Dongwoo Kang
Dongwoo Kang
Hongik University
Image ProcessingComputer VisionMedical Image AnalysisAugmented Reality