🤖 AI Summary
This study addresses covert emotion recognition by precisely detecting onset, offset, and apex frames of both macro- and micro-expressions in videos and modeling their underlying affective timelines. To mitigate spurious correlations between facial Action Units (AUs) and emotion categories—arising from dataset bias—we propose a causally grounded, unbiased AU modeling framework. Instead of conventional AU adjacency modeling, our approach employs causal inference to explicitly disentangle non-causal AU–emotion associations, retaining only causally relevant AUs for classification. The method integrates fast causal discovery, a causal graph neural network, and AU temporal modeling to achieve end-to-end micro-expression spotting. Evaluated on CAS(ME)² and SAMM-Long Video datasets, it achieves F1-scores of 0.388 and 0.3701, respectively—substantially outperforming state-of-the-art methods.
📝 Abstract
Detecting concealed emotions within apparently normal expressions is crucial for identifying potential mental health issues and facilitating timely support and intervention. The task of spotting macro and micro-expressions involves predicting the emotional timeline within a video, accomplished by identifying the onset, apex, and offset frames of the displayed emotions. Utilizing foundational facial muscle movement cues, known as facial action units, boosts the accuracy. However, an overlooked challenge from previous research lies in the inadvertent integration of biases into the training model. These biases arising from datasets can spuriously link certain action unit movements to particular emotion classes. We tackle this issue by novel replacement of action unit adjacency information with the action unit causal graphs. This approach aims to identify and eliminate undesired spurious connections, retaining only unbiased information for classification. Our model, named Causal-Ex (Causal-based Expression spotting), employs a rapid causal inference algorithm to construct a causal graph of facial action units. This enables us to select causally relevant facial action units. Our work demonstrates improvement in overall F1-scores compared to state-of-the-art approaches with 0.388 on CAS(ME)^2 and 0.3701 on SAMM-Long Video datasets.