🤖 AI Summary
Dynamic facial expression recognition (DFER) faces two key challenges: model bias induced by long-tailed class distributions and the complexity of spatiotemporal feature modeling. To address these, we propose a multi-instance class-aware contrastive learning framework. First, a graph-enhanced instance interaction module models dynamic relationships among video clips via an adaptive adjacency matrix. Second, a weighted instance aggregation network enables importance-aware spatiotemporal feature fusion. Third, a multi-scale class-aware contrastive learning mechanism—integrating class-aware sampling and multi-scale convolutions—mitigates training imbalance. By deeply integrating graph neural networks with attention mechanisms, our method achieves state-of-the-art performance on DFEW and FERV39k benchmarks. It significantly improves minority-class accuracy, model robustness, and generalization capability, demonstrating superior effectiveness in handling long-tailed DFER tasks.
📝 Abstract
Dynamic facial expression recognition (DFER) faces significant challenges due to long-tailed category distributions and complexity of spatio-temporal feature modeling. While existing deep learning-based methods have improved DFER performance, they often fail to address these issues, resulting in severe model induction bias. To overcome these limitations, we propose a novel multi-instance learning framework called MICACL, which integrates spatio-temporal dependency modeling and long-tailed contrastive learning optimization. Specifically, we design the Graph-Enhanced Instance Interaction Module (GEIIM) to capture intricate spatio-temporal between adjacent instances relationships through adaptive adjacency matrices and multiscale convolutions. To enhance instance-level feature aggregation, we develop the Weighted Instance Aggregation Network (WIAN), which dynamically assigns weights based on instance importance. Furthermore, we introduce a Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance training between major and minor categories. Extensive experiments on in-the-wild datasets (i.e., DFEW and FERV39k) demonstrate that MICACL achieves state-of-the-art performance with superior robustness and generalization.