FaceXBench: Evaluating Multimodal LLMs on Face Understanding

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) lack systematic evaluation for facial understanding tasks. Method: We introduce FaceXBench—the first comprehensive benchmark—comprising 14 facial understanding tasks (e.g., face identification, attribute analysis, localization, fair recognition) and 5,000 multimodal multiple-choice questions. We formally define and quantify MLLM capabilities across six dimensions: bias, authentication, recognition, analysis, localization, and fairness; propose a cross-source data fusion strategy and release the FaceXAPI dataset; and design a three-paradigm prompting framework for comparative analysis. Contribution/Results: Evaluating 28 state-of-the-art MLLMs—including GPT-4o and Gemini Pro 1.5—under zero-shot, in-context learning, and chain-of-thought settings, we identify pervasive performance bottlenecks on complex facial tasks: average accuracy is significantly lower than on general vision benchmarks, exposing critical limitations in current MLLMs’ facial understanding capacity.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs' face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: https://github.com/Kartik-3004/facexbench
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Facial Image Recognition
Performance Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

FaceXBench
AI Facial Recognition
Performance Evaluation
🔎 Similar Papers
No similar papers found.