Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

This study systematically evaluates the performance gap between general-purpose multimodal foundation models (CLIP, BLIP, LLaVA, DINO) and domain-specific face recognition models (ArcFace, AdaFace) on face verification. It identifies that foundation models exhibit superior robustness in fine-grained scenarios—e.g., occluded or over-segmented faces. To leverage this, we propose a score-level fusion framework integrating zero-shot inference with context-aware weighting. Furthermore, we incorporate a prompt-driven large language model to generate natural-language explanations, enhancing decision interpretability and misclassification correction. Experiments show that while specialized models remain superior on standard benchmarks, our fusion method significantly improves accuracy under stringent false-positive constraints: on IJB-B, the true match rate reaches 83.31% (+10.67%). Our core contributions are: (i) revealing the fine-grained robustness advantage of foundation models for face verification; (ii) designing an interpretable, cross-paradigm fusion architecture; and (iii) empirically validating its efficacy in high-reliability operational settings.

Technology Category

Application Category

📝 Abstract

In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., ``Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., ``Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity...and the pair is likely of the same person''), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.

Problem

Research questions and friction points this paper is trying to address.

Compare foundation and domain-specific models for face recognition

Evaluate performance impact of contextual clues in face images

Explore fusion and explainability in face recognition models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific models outperform zero-shot foundation models

Score-level fusion improves face recognition accuracy

Foundation models enhance explainability in face recognition

🔎 Similar Papers

ComFe: An Interpretable Head for Vision Transformers

2024-03-07Citations: 0