An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor cross-scenario generalization of zero-shot facial expression recognition (FER), this paper proposes a novel visual question answering (VQA)-based paradigm: FER is reformulated as a multimodal large language model’s response to predefined semantic questions about facial expressions, eliminating conventional classification heads. The method employs lightweight, locally deployable vision-language models (VLMs) that decouple visual perception from semantic reasoning, enabling zero-shot transfer. We conduct the first systematic evaluation of multiple lightweight VLMs on standard benchmarks—including AffectNet, FERPlus, and RAF-DB—demonstrating substantial improvements in cross-domain generalization. Notably, several models achieve performance competitive with fully supervised FER methods. These results empirically validate the efficacy of the VQA paradigm for semantic generalization in facial expression understanding.

Technology Category

Application Category

📝 Abstract
Facial expression recognition (FER) is a key research area in computer vision and human-computer interaction. Despite recent advances in deep learning, challenges persist, especially in generalizing to new scenarios. In fact, zero-shot FER significantly reduces the performance of state-of-the-art FER models. To address this problem, the community has recently started to explore the integration of knowledge from Large Language Models for visual tasks. In this work, we evaluate a broad collection of locally executed Visual Language Models (VLMs), avoiding the lack of task-specific knowledge by adopting a Visual Question Answering strategy. We compare the proposed pipeline with state-of-the-art FER models, both integrating and excluding VLMs, evaluating well-known FER benchmarks: AffectNet, FERPlus, and RAF-DB. The results show excellent performance for some VLMs in zero-shot FER scenarios, indicating the need for further exploration to improve FER generalization.
Problem

Research questions and friction points this paper is trying to address.

Evaluating zero-shot facial expression recognition in images
Addressing generalization challenges in FER using VLMs
Comparing VLM-based FER pipelines with state-of-the-art models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Visual Question Answering strategy
Integrates locally executed VLMs
Evaluates zero-shot FER benchmarks
🔎 Similar Papers
No similar papers found.
M
Modesto Castrill'on-Santana
SIANI - Universidad de Las Palmas de Gran Canaria, Spain
Oliverio J. Santana
Oliverio J. Santana
Departamento de Informática y Sistemas, Universidad de Las Palmas de Gran Canaria
Deep Learning
D
David Freire-Obreg'on
SIANI - Universidad de Las Palmas de Gran Canaria, Spain
D
Daniel Hern'andez-Sosa
SIANI - Universidad de Las Palmas de Gran Canaria, Spain
J
J. Lorenzo-Navarro
SIANI - Universidad de Las Palmas de Gran Canaria, Spain