Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

πŸ“… 2024-09-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the performance bottleneck of dynamic facial expression recognition (DFER) caused by scarce labeled data, this paper proposes Static-for-Dynamic (S4D), a novel dual-modal learning framework that systematically leverages abundant static facial expression recognition (SFER) data to enhance DFERβ€”its first such application. The method comprises: (1) dual-modal self-supervised pretraining using a shared Vision Transformer (ViT)-based encoder-decoder architecture; and (2) a Mixture-of-Adapter Experts (MoAE) module designed to mitigate static-to-dynamic negative transfer while jointly optimizing task-specific adaptation and cross-modal knowledge sharing. Evaluated on FERV39K, MAFW, and DFEW benchmarks, S4D achieves new state-of-the-art weighted accuracy (WAR) scores of 53.65%, 58.44%, and 76.68%, respectively. Moreover, it is the first work to uncover and model systematic semantic correlations between SFER and DFER.

Technology Category

Application Category

πŸ“ Abstract
Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.
Problem

Research questions and friction points this paper is trying to address.

Static Facial Expression Recognition
Dynamic Facial Expression Recognition
Data-limited Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

S4D Framework
MoAE Module
Integrated SFER and DFER
πŸ”Ž Similar Papers
Yin Chen
Yin Chen
Lecturer in Mathematics at University of Saskatchewan
Invariant theoryLie theoryCommutative algebraApplied algebraic geometry
J
Jia Li
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
Y
Yu Zhang
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
Zhenzhen Hu
Zhenzhen Hu
Hefei University of Technology
Multimedia
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
M
Meng Wang
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
Richang Hong
Richang Hong
Hefei University of Technology
MultimediaPattern Recognition