🤖 AI Summary
Multimodal large language models (MLLMs) lack systematic evaluation for facial understanding tasks. Method: We introduce FaceXBench—the first comprehensive benchmark—comprising 14 facial understanding tasks (e.g., face identification, attribute analysis, localization, fair recognition) and 5,000 multimodal multiple-choice questions. We formally define and quantify MLLM capabilities across six dimensions: bias, authentication, recognition, analysis, localization, and fairness; propose a cross-source data fusion strategy and release the FaceXAPI dataset; and design a three-paradigm prompting framework for comparative analysis. Contribution/Results: Evaluating 28 state-of-the-art MLLMs—including GPT-4o and Gemini Pro 1.5—under zero-shot, in-context learning, and chain-of-thought settings, we identify pervasive performance bottlenecks on complex facial tasks: average accuracy is significantly lower than on general vision benchmarks, exposing critical limitations in current MLLMs’ facial understanding capacity.
📝 Abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs' face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: https://github.com/Kartik-3004/facexbench