Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit weak reasoning capabilities and poor interpretability in facial expression recognition (FER). Method: We introduce FERBench—the first systematic benchmark evaluating 20 state-of-the-art MLLMs across four major FER datasets—and propose a unified visual question answering (VQA) paradigm for FER. We release two large-scale datasets: UniFER-CoT-230K (cold-start chain-of-thought) and UniFER-RLVR-360K (reinforcement learning with verifiable rewards). Furthermore, we design UniFER-7B, an interpretable FER foundation model integrating VQA conversion, post-training optimization, chain-of-thought prompting, and verifiable-reward-based reinforcement learning. Results: UniFER-7B achieves significant performance gains over both open- and closed-source general-purpose MLLMs (e.g., Gemini-2.5-Pro, Qwen2.5-VL-72B) across multiple FER benchmarks, demonstrating superior classification accuracy and strong, human-verifiable reasoning interpretability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' performance on facial expression recognition tasks

Addressing limitations in reasoning and interpretability of MLLMs

Developing enhanced datasets and foundation models for FER

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converted FER datasets into VQA formats

Introduced post-training strategies for reasoning enhancement

Developed UniFER-7B foundation model with RLVR

🔎 Similar Papers

No similar papers found.