SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation in semi-supervised deep facial expression recognition (SS-DFER) caused by unreliable pseudo-labels, this paper proposes a multimodal collaborative pseudo-labeling framework integrating semantic-, instance-, and text-level information. We introduce, for the first time, a three-tier pseudo-labeling mechanism—semantic, instance, and text—incorporating fine-grained textual description embeddings to enable cross-modal semantic alignment. Our approach jointly models visual–textual and visual–instance similarities, employs weighted probability aggregation, and applies text-guided joint supervision, augmented by contrastive learning to enhance instance representations. Extensive experiments on three benchmark datasets demonstrate substantial improvements over existing semi-supervised methods; notably, under certain settings, our method even surpasses fully supervised baselines. These results validate the effectiveness and generalizability of cross-modal collaborative modeling for SS-DFER.

Technology Category

Application Category

📝 Abstract
Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.
Problem

Research questions and friction points this paper is trying to address.

Improves semi-supervised facial expression recognition accuracy
Integrates semantic, instance, and text-level information
Generates high-quality pseudo-labels for unlabeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates semantic, instance, text-level information
Generates high-quality pseudo-labels via multi-level probabilities
Uses text embeddings to co-supervise model training
S
Sixian Ding
College of Computer Science, Sichuan University, Chengdu 610065, China
Xu Jiang
Xu Jiang
Duke University
Information economicsaccounting standard settingreal effectsdisclosurefinancial institutions
Zhongjing Du
Zhongjing Du
四川大学
Artificial IntelligenceDeep LearningFacial Expression Recognition
J
Jiaqi Cui
College of Computer Science, Sichuan University, Chengdu 610065, China
Xinyi Zeng
Xinyi Zeng
Sichuan University
Medical Image SegmentationMedical Image ReconstructionMulti-modal Learning
Y
Yan Wang
College of Computer Science, Sichuan University, Chengdu 610065, China