FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

📅 2024-07-02
🏛️ ACM Multimedia
📈 Citations: 16
Influential: 0
📄 PDF
🤖 AI Summary
To address two key bottlenecks in Dynamic Facial Expression Recognition (DFER)—inadequate facial dynamic modeling and ambiguous expression semantics—this paper proposes a multimodal fine-grained CLIP framework. Methodologically, it introduces (i) a novel three-level semantic modeling scheme integrating frame-level, segment/keypoint-level, and MLLM-generated descriptive visual–language representations; (ii) a cross-modal contrastive learning mechanism supervised by positive/negative textual labels; and (iii) the first adaptation of AdaptERs to CLIP for parameter-efficient fine-tuning, enabling both zero-shot and few-shot recognition. Evaluated on DFEW, FERV39k, and MAFW, the method achieves state-of-the-art performance in both fully supervised and zero-shot settings, using fewer than 0.5% trainable parameters. It significantly enhances dynamic semantic alignment and generalization across domains and data regimes.

Technology Category

Application Category

📝 Abstract
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for DFER with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Project page: https://haroldchen19.github.io/FineCLIPER-Page/
Problem

Research questions and friction points this paper is trying to address.

Limited performance in dynamic facial expression recognition
Scarcity of high-quality data and ambiguous expression semantics
Insufficient utilization of facial dynamics in current methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends labels to textual descriptions for cross-modal supervision
Uses hierarchical cues from videos with MLLM descriptions
Applies Parameter-Efficient Fine-Tuning for CLIP adaptation
H
Haodong Chen
School of Automation, Northwestern Polytechnical University
H
Haojian Huang
The University of Hong Kong
J
Junhao Dong
Nanyang Technological University
Mingzhe Zheng
Mingzhe Zheng
Hong Kong University of Science and Technology
Visual GenerationVisual UnderstandingComputer VisionMachine Learning
Dian Shao
Dian Shao
Associate Professor, Northwest Polytechnical University Xi'an
computer visiondeep learningUAV