🤖 AI Summary
This study addresses the lack of systematic evaluation of multimodal large language models (MLLMs) on fine-grained facial perception tasks. To this end, we introduce FaceBench—the first multi-view, multi-level visual question-answering benchmark specifically designed for facial perception—comprising 5 perceptual perspectives, 3 hierarchical levels, and over 210 attributes, with 49,919 evaluation and 23,841 fine-tuning samples. We propose a hierarchical facial attribute ontology and a tri-dimensional multi-view evaluation framework grounded in geometric, semantic, and social perspectives. Furthermore, we establish the first human-performance-aligned benchmark for quantifying facial perception capability. Leveraging FaceBench, we train Face-LLaVA—a lightweight MLLM—that achieves significant performance gains over leading open-source MLLMs using minimal data, matching GPT-4o and Gemini in facial attribute understanding. Our analysis further uncovers systematic limitations of current MLLMs in fine-grained facial cognition.
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.