🤖 AI Summary
This work addresses the lack of decision transparency in existing face verification systems, which struggle to provide trustworthy explanations. The authors propose a novel vision-language model capable of simultaneously determining whether two facial images belong to the same identity and generating natural language explanations—either concise or detailed. The approach innovatively introduces cross-modal transfer learning into the explainable face verification task, integrates complementary explanation styles, and combines state-of-the-art vision-language architectures with a visual reasoning mechanism inspired by audio difference modeling. Experimental results demonstrate that the proposed model significantly outperforms current baselines in both verification accuracy and explanation quality, highlighting the substantial potential of vision-language models for interpretable face verification.
📝 Abstract
Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model's accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.