🤖 AI Summary
This work addresses the challenge in micro-expression recognition that identical facial action units (AUs) often correspond to different emotions, resulting in high visual similarity and the neglect of implicit emotional cues in existing approaches. To tackle this issue, the authors propose a dual-branch network: a motion branch guided by AU detection to extract explicit dynamic features, and an emotion branch incorporating a Sparse Emotion Vision Transformer (SEViT) to capture multi-scale local temporal emotional clues. The two types of features are explicitly disentangled via an orthogonality loss, and a Collaborative Fusion Module (CoFM) adaptively integrates them. This method is the first to achieve explicit disentanglement of motion and emotion features in micro-expressions, significantly outperforming state-of-the-art methods on three benchmark datasets and substantially improving both recognition accuracy and generalization capability.
📝 Abstract
Unlike macro-expression, micro-expression does not follow a strictly consistent mapping rule between emotions and Action Units (AUs). As a result, some micro-expressions share identical AUs yet represent completely opposite emotional categories, making them highly visually similar. Existing microexpression recognition (MER) methods mostly rely on explicit facial motion cues (e.g., optical flow, frame differences, AU features) while ignoring implicit emotion information. To tackle this issue, this paper presents a Motion Emotion Feature Decoupling Network (MEDN) for MER. We design a dual-branch framework to separately extract motion and emotion features. In the motion branch, an AU-detection task restricts features to the explicit motion domain, and orthogonal loss is adopted to reduce motion emotion feature coupling. For implicit emotion modeling, we propose a Sparse Emotion Vision Transformer (SEVit) that sparsifies spatial tokens to highlight local temporal variations with multi-scale sparsity rates. A Collaborative Fusion Module (CoFM) is further developed to fuse disentangled motion and emotion features adaptively. Extensive experiments on three benchmark datasets validate that MEDN effectively decouples motion and emotion features and achieves superior recognition performance, offering a new perspective for enhancing recognition accuracy and generalization.