Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the prevalent “multi-frame, single-label” weak-supervision challenge in Dynamic Facial Expression Recognition (DFER), this paper proposes a text-guided weak-supervision framework. Methodologically: (i) it leverages vision-language pre-trained models to inject affective semantic priors; (ii) it introduces a visual prompting mechanism to align textual emotion descriptions with frame-level visual features; and (iii) it constructs a multi-granularity temporal network to jointly model short-term micro-actions and long-term affective evolution. Our key contribution is the first integration of semantic guidance with hierarchical temporal modeling, enabling fine-grained emotion reasoning and frame-level relevance estimation without frame-level annotations. Experiments demonstrate substantial improvements in temporal consistency and generalization across multiple benchmarks, outperforming state-of-the-art multiple-instance learning (MIL) and weak-supervision approaches.

Technology Category

Application Category

📝 Abstract

Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

Problem

Research questions and friction points this paper is trying to address.

Addresses visual diversity and temporal complexity in dynamic facial expression recognition

Solves many-to-one labeling problem through text-guided weakly supervised learning

Enhances emotion understanding by aligning visual features with semantic textual descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided weakly supervised framework using vision-language model

Visual prompts align text labels with visual features

Multi-grained temporal network captures short and long-term dynamics

🔎 Similar Papers

Rethinking the Learning Paradigm for Facial Expression Recognition