🤖 AI Summary
Facial expression spotting—particularly micro-expression spotting—faces two major challenges: interference from non-expressive facial movements and difficulty in modeling subtle, transient dynamics. To address these, we propose a multi-scale spatiotemporal modeling framework tailored for video-level temporal localization. First, we introduce Sliding-Window Multi-Resolution Optical Flow (SW-MRO), a novel motion-sensitive feature extraction method that enhances discriminability of fine-grained facial dynamics. Second, we design a multi-scale spatiotemporal Transformer integrating Facial Local Graph Pooling (FLGP) with convolutional layers to jointly capture local and global spatiotemporal dependencies. Third, we pioneer the incorporation of supervised contrastive learning into the spotting task to strengthen frame-level probabilistic discrimination. Our method achieves state-of-the-art performance on SAMM-LV and CAS(ME)², significantly improving micro-expression spotting F1-score while demonstrating strong robustness against head motion and non-expressive facial actions.
📝 Abstract
Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)^2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.