π€ AI Summary
This study systematically evaluates the performance gap between general-purpose multimodal foundation models (CLIP, BLIP, LLaVA, DINO) and domain-specific face recognition models (ArcFace, AdaFace) on face verification. It identifies that foundation models exhibit superior robustness in fine-grained scenariosβe.g., occluded or over-segmented faces. To leverage this, we propose a score-level fusion framework integrating zero-shot inference with context-aware weighting. Furthermore, we incorporate a prompt-driven large language model to generate natural-language explanations, enhancing decision interpretability and misclassification correction. Experiments show that while specialized models remain superior on standard benchmarks, our fusion method significantly improves accuracy under stringent false-positive constraints: on IJB-B, the true match rate reaches 83.31% (+10.67%). Our core contributions are: (i) revealing the fine-grained robustness advantage of foundation models for face verification; (ii) designing an interpretable, cross-paradigm fusion architecture; and (iii) empirically validating its efficacy in high-reliability operational settings.
π Abstract
In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., ``Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., ``Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity...and the pair is likely of the same person''), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.