TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the susceptibility of existing vision-language models (VLMs) to hallucinations in facial expression recognition, stemming from a disconnect between reasoning and visual evidence, as well as their limited robustness across datasets. To mitigate these issues, the authors propose TAG, a novel framework that introduces Facial Action Units (AUs) as structured, verifiable intermediate representations within multimodal reasoning, explicitly anchoring intermediate inference steps to AU-relevant visual regions. TAG integrates AU-region-supervised fine-tuning with reinforcement learning guided by an AU-aware reward mechanism, thereby enhancing visual faithfulness and reducing hallucinatory outputs. Experimental results demonstrate that TAG consistently outperforms both open-source and closed-source VLM baselines on RAF-DB, FERPlus, and AffectNet, significantly improving reasoning interpretability and cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .

Problem

Research questions and friction points this paper is trying to address.

Facial Expression Recognition

Vision-Language Models

Hallucination

Action Units

Visual Grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action Unit Grounding

Vision-Language Model

Facial Expression Recognition