🤖 AI Summary
This work addresses the limited accuracy and robustness of facial expression recognition in complex scenarios by proposing a residual mask network architecture that integrates an explicit attention mechanism into convolutional neural networks through segmentation masks. The approach combines deep residual networks with a U-Net-like structure to generate refined feature masks corresponding to expression-relevant facial regions, thereby guiding the model to focus on discriminative areas and enhancing feature learning. Experimental results on the FER2013 benchmark and a private VEMO dataset demonstrate that the proposed method achieves state-of-the-art recognition accuracy, significantly improving model performance under challenging environmental conditions.
📝 Abstract
Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking Idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets.