Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

πŸ“… 2025-11-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the prevalent β€œmulti-frame, single-label” weak-supervision challenge in Dynamic Facial Expression Recognition (DFER), this paper proposes a text-guided weak-supervision framework. Methodologically: (i) it leverages vision-language pre-trained models to inject affective semantic priors; (ii) it introduces a visual prompting mechanism to align textual emotion descriptions with frame-level visual features; and (iii) it constructs a multi-granularity temporal network to jointly model short-term micro-actions and long-term affective evolution. Our key contribution is the first integration of semantic guidance with hierarchical temporal modeling, enabling fine-grained emotion reasoning and frame-level relevance estimation without frame-level annotations. Experiments demonstrate substantial improvements in temporal consistency and generalization across multiple benchmarks, outperforming state-of-the-art multiple-instance learning (MIL) and weak-supervision approaches.

Technology Category

Application Category

πŸ“ Abstract
Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.
Problem

Research questions and friction points this paper is trying to address.

Addresses visual diversity and temporal complexity in dynamic facial expression recognition
Solves many-to-one labeling problem through text-guided weakly supervised learning
Enhances emotion understanding by aligning visual features with semantic textual descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided weakly supervised framework using vision-language model
Visual prompts align text labels with visual features
Multi-grained temporal network captures short and long-term dynamics
πŸ”Ž Similar Papers
No similar papers found.
Gunho Jung
Gunho Jung
Korea University
Heejo Kong
Heejo Kong
Korea University
Deep LearningMachine LearningPhysics-informed ML
S
Seong-Whan Lee
Department of Artificial Intelligence, Korea University, Seoul, 02841, Republic of Korea