Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the lack of interpretability and cross-domain generalization in morphing attack detection (MAD) for face recognition systems, this paper proposes the first image–text multimodal zero-shot learning framework. Methodologically, it leverages a pre-trained CLIP model with ten semantically explicit short- and long-form textual prompts to enable prompt-based zero-shot classification and interpretable text generation—without fine-tuning. Crucially, it introduces human-readable textual explanations into MAD for the first time, establishing a direct mapping between visual anomalies and natural-language semantic descriptions. Evaluated on a newly constructed benchmark covering five state-of-the-art morphing algorithms and three imaging modalities, our approach significantly outperforms supervised baselines, demonstrating strong robustness and generalization across morphing techniques and acquisition media. This work establishes a novel paradigm for trustworthy face verification grounded in interpretable, zero-shot multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.

Problem

Research questions and friction points this paper is trying to address.

Detect morphing attacks in face recognition systems

Provide textual descriptions for morphing attack detection

Evaluate zero-shot CLIP framework on diverse morphing techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses CLIP for zero-shot morphing attack detection

Engineers human-understandable textual prompts

Evaluates multiple morphing techniques across mediums

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models