An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the poor cross-scenario generalization of zero-shot facial expression recognition (FER), this paper proposes a novel visual question answering (VQA)-based paradigm: FER is reformulated as a multimodal large language model’s response to predefined semantic questions about facial expressions, eliminating conventional classification heads. The method employs lightweight, locally deployable vision-language models (VLMs) that decouple visual perception from semantic reasoning, enabling zero-shot transfer. We conduct the first systematic evaluation of multiple lightweight VLMs on standard benchmarks—including AffectNet, FERPlus, and RAF-DB—demonstrating substantial improvements in cross-domain generalization. Notably, several models achieve performance competitive with fully supervised FER methods. These results empirically validate the efficacy of the VQA paradigm for semantic generalization in facial expression understanding.

Technology Category

Application Category

📝 Abstract

Facial expression recognition (FER) is a key research area in computer vision and human-computer interaction. Despite recent advances in deep learning, challenges persist, especially in generalizing to new scenarios. In fact, zero-shot FER significantly reduces the performance of state-of-the-art FER models. To address this problem, the community has recently started to explore the integration of knowledge from Large Language Models for visual tasks. In this work, we evaluate a broad collection of locally executed Visual Language Models (VLMs), avoiding the lack of task-specific knowledge by adopting a Visual Question Answering strategy. We compare the proposed pipeline with state-of-the-art FER models, both integrating and excluding VLMs, evaluating well-known FER benchmarks: AffectNet, FERPlus, and RAF-DB. The results show excellent performance for some VLMs in zero-shot FER scenarios, indicating the need for further exploration to improve FER generalization.

Problem

Research questions and friction points this paper is trying to address.

Evaluating zero-shot facial expression recognition in images

Addressing generalization challenges in FER using VLMs

Comparing VLM-based FER pipelines with state-of-the-art models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Visual Question Answering strategy

Integrates locally executed VLMs

Evaluates zero-shot FER benchmarks

🔎 Similar Papers

No similar papers found.