Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two critical limitations of vision-language models (VLMs) in facial emotion analysis (FEA): (1) hallucinated reasoning due to insufficient emotion knowledge, and (2) reasoning–recognition misalignment caused by weak associations between facial features and emotion labels. To resolve these, we propose a three-stage alignment framework that jointly models emotion recognition, action unit (AU) detection, and AU-driven interpretable reasoning. We explicitly align reasoning paths with AU and emotion labels via reinforcement learning and introduce a self-iterative synthetic data pipeline for continual low-supervision optimization. Our method integrates instruction tuning, label-guided reinforcement training, VLM architecture adaptation, and synthetic data augmentation. It achieves state-of-the-art performance across eight FEA benchmarks and introduces FEA-20K—a new dataset of 20,000 samples—significantly improving generalization and interpretability.

Technology Category

Application Category

📝 Abstract
Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinated reasoning in emotion analysis models
Solves misalignment between emotion reasoning and recognition
Improves facial emotion analysis with minimal supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction fine-tuning establishes basic emotional reasoning capability
Reinforcement training aligns reasoning process with emotion prediction
Iterative data synthesis pipeline enables scalable self-improvement
🔎 Similar Papers
No similar papers found.
J
Jiulong Wu
School of Computer Science and Technology, Soochow University, Suzhou, China
Y
Yucheng Shen
School of Computer Science and Technology, Soochow University, Suzhou, China
Lingyong Yan
Lingyong Yan
Baidu Inc.
Large Language ModelMachine Learning
H
Haixin Sun
School of Computer Science and Technology, Soochow University, Suzhou, China
Deguo Xia
Deguo Xia
Baidu
Generative AIMultimodal LearningComputer Vision
Jizhou Huang
Jizhou Huang
Baidu
Generative AIData MiningNatural Language Processing
M
Min Cao
School of Computer Science and Technology, Soochow University, Suzhou, China