🤖 AI Summary
Multimodal large language models (MLLMs) exhibit weak reasoning capabilities and poor interpretability in facial expression recognition (FER). Method: We introduce FERBench—the first systematic benchmark evaluating 20 state-of-the-art MLLMs across four major FER datasets—and propose a unified visual question answering (VQA) paradigm for FER. We release two large-scale datasets: UniFER-CoT-230K (cold-start chain-of-thought) and UniFER-RLVR-360K (reinforcement learning with verifiable rewards). Furthermore, we design UniFER-7B, an interpretable FER foundation model integrating VQA conversion, post-training optimization, chain-of-thought prompting, and verifiable-reward-based reinforcement learning. Results: UniFER-7B achieves significant performance gains over both open- and closed-source general-purpose MLLMs (e.g., Gemini-2.5-Pro, Qwen2.5-VL-72B) across multiple FER benchmarks, demonstrating superior classification accuracy and strong, human-verifiable reasoning interpretability.
📝 Abstract
Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).