Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

📅 2024-09-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the performance bottleneck of dynamic facial expression recognition (DFER) caused by scarce labeled data, this paper proposes Static-for-Dynamic (S4D), a novel dual-modal learning framework that systematically leverages abundant static facial expression recognition (SFER) data to enhance DFER—its first such application. The method comprises: (1) dual-modal self-supervised pretraining using a shared Vision Transformer (ViT)-based encoder-decoder architecture; and (2) a Mixture-of-Adapter Experts (MoAE) module designed to mitigate static-to-dynamic negative transfer while jointly optimizing task-specific adaptation and cross-modal knowledge sharing. Evaluated on FERV39K, MAFW, and DFEW benchmarks, S4D achieves new state-of-the-art weighted accuracy (WAR) scores of 53.65%, 58.44%, and 76.68%, respectively. Moreover, it is the first work to uncover and model systematic semantic correlations between SFER and DFER.

Technology Category

Application Category

📝 Abstract

Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.

Problem

Research questions and friction points this paper is trying to address.

Static Facial Expression Recognition

Dynamic Facial Expression Recognition

Data-limited Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

S4D Framework

MoAE Module

Integrated SFER and DFER

🔎 Similar Papers

The Face of Populism: Examining Differences in Facial Emotional Expressions of Political Leaders Using Machine Learning