LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of facial expression recognition under real-world variations in pose, occlusion, and illumination by proposing a novel framework that integrates geometric and semantic priors. The approach employs a landmark-guided adaptive encoder and a dual-branch gated cross-attention mechanism to effectively fuse the geometric information from facial landmarks with the semantic knowledge derived from a CLIP vision-language model. Additionally, expression-conditioned prompts are introduced to enhance feature representation. Evaluated on three benchmark datasets—RAF-DB, FERPlus, and AffectNet—the proposed method consistently outperforms state-of-the-art approaches, demonstrating significantly improved robustness and generalization in facial expression recognition.

📝 Abstract

Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.

Problem

Research questions and friction points this paper is trying to address.

Facial Expression Recognition

in-the-wild

attention redundancy

occlusion

illumination variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

landmark-guided

contrastive learning

vision-language model