🤖 AI Summary
Micro-expression recognition (MER) faces dual challenges: difficulty in modeling dynamic spatiotemporal information and severe scarcity of labeled training data. To address these, we propose a fine-grained dynamic-aware framework comprising three key components: (1) a local-global feature-aware Transformer for frame-level representation learning; (2) a ranking-based scoring mechanism to explicitly model fine-grained temporal relationships between appearance and motion dynamics; and (3) joint dynamic image reconstruction to enhance model sensitivity to subtle facial movements and alleviate data scarcity. Temporal pooling enables shared representation learning across recognition and reconstruction tasks. Evaluated on four benchmark datasets—CASME II, SAMM, CAS(ME)², and CAS(ME)³—our method achieves absolute F1-score improvements of 4.05%, 2.50%, 7.71%, and 2.11%, respectively, outperforming state-of-the-art approaches significantly.
📝 Abstract
Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware transformer is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-decoder structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at https://github.com/CYF-cuber/FDP.