🤖 AI Summary
This work addresses the limited interpretability of deep models in face verification by proposing the first *intrinsically interpretable* deep neural network framework—eliminating reliance on post-hoc explanation methods. Methodologically, it introduces local matching modules with constrained receptive fields (28×28–56×56) that enable *additive decomposition* of global similarity over 112×112 input images, yielding final decisions as linear aggregations of region-wise facial contributions. Key contributions include: (i) the first locally additive, inherently interpretable similarity metric requiring no posterior intervention; (ii) strong discriminative power even at minimal patch size (28×28); and (iii) state-of-the-art performance on major benchmarks using 56×56 patches. The design substantially enhances model transparency and decision trustworthiness, establishing a new paradigm for trustworthy face verification.
📝 Abstract
Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.