🤖 AI Summary
Existing deep facial expression recognition methods rely on discriminative classifiers, which are prone to learning shortcut features and exhibit limited robustness under distribution shifts. To address this, this work proposes the Emotion Diffusion Classifier (EmoDC), a conditional generative diffusion-based framework augmented with an Adaptive Margin Difference Training (AMDiT) strategy. AMDiT dynamically adjusts sample-level margins to enhance the model’s ability to distinguish between correct and incorrect class-conditional predictions. By jointly optimizing noise prediction error and margin discrepancy, EmoDC achieves substantial improvements in classification accuracy across multiple benchmarks—including RAF-DB, SFEW-2.0, and AffectNet—and demonstrates superior robustness compared to existing discriminative approaches under perturbations such as noise and image blur.
📝 Abstract
Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model's discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.