🤖 AI Summary
To address the scarcity of high-quality labeled data and poor generalization of pretrained models in image emotion recognition (IER), this paper proposes a cross-domain IER method tailored for generic images—including non-facial and non-human content. Leveraging facial expression recognition (FER) models, it transfers knowledge via feature-level domain adaptation. We introduce DnCShap, the first explainability framework integrating a “divide-and-conquer” strategy with SHAP to enhance discriminability and attribution credibility. The method jointly optimizes discrepancy loss for domain alignment, embeds space projection analysis, and visualizes salient regions via heatmaps. Evaluated on four benchmark datasets—IAPSa, ArtPhoto, FI, and EMOTIC—our approach achieves accuracies of 61.86%, 62.47%, 70.78%, and 59.72%, respectively, significantly outperforming existing unsupervised and weakly supervised IER methods.
📝 Abstract
This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images, encompassing both facial and non-facial objects, as well as non-human components. This approach addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER). Initially, a deep-learning-based Facial Expression Recognition (FER) system is developed, classifying facial images into discrete emotion classes. Maintaining the same network architecture, this FER system is then adapted to recognize emotions in generic images through the application of discrepancy loss, enabling the model to effectively learn IER features while classifying emotions into categories such as 'happy,' 'sad,' 'hate,' and 'anger.' Additionally, a novel interpretability method, Divide and Conquer based Shap (DnCShap), is introduced to elucidate the visual features most relevant for emotion recognition. The proposed IER system demonstrated emotion classification accuracies of 61.86% for the IAPSa dataset, 62.47 for the ArtPhoto dataset, 70.78% for the FI dataset, and 59.72% for the EMOTIC dataset. The system effectively identifies the important visual features that lead to specific emotion classifications and also provides detailed embedding plots explaining the predictions, enhancing the understanding and trust in AI-driven emotion recognition systems.