🤖 AI Summary
This study addresses the limited reliability of facial comparison explanations generated by multimodal large language models (MLLMs), which often contain unverifiable or hallucinated content when applied to unconstrained images—such as those with extreme poses or surveillance footage. The work presents the first systematic evaluation of explanation faithfulness for MLLMs on the IJB-S dataset, integrating decision scores and outputs from traditional face recognition systems to enhance explanation quality. It further introduces a likelihood ratio–based framework to quantify evidential strength, offering a novel metric for explanation credibility beyond mere decision accuracy. Findings reveal that even when MLLMs make correct judgments, their explanations frequently rely on attributes unsupported by visual evidence. Although incorporating traditional system information improves classification performance, it does not substantially increase explanation faithfulness, exposing fundamental limitations of current MLLMs in explainable biometrics.
📝 Abstract
Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.