In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Face recognition systems face robustness challenges against both physical presentation attacks (e.g., printed photos, masks) and digital manipulation attacks (e.g., adversarial perturbations, GAN-generated faces). Method: This paper proposes a context-learning framework based on vision-language models (VLMs), eliminating reliance on large-scale labeled datasets. It enables zero-shot or few-shot inference via task-specific prompting, facilitating rapid adaptation to previously unseen attack types. Contribution/Results: We introduce the first quantitative VLM evaluation framework tailored for security-critical scenarios, rigorously assessing cross-attack-type and cross-environment generalization. On public benchmarks—including CASIA-SURF and Replay-Attack—our approach matches or surpasses supervised CNN-based detectors in detection accuracy, while significantly improving inference efficiency and deployment flexibility. This establishes a novel paradigm for resource-constrained and privacy-sensitive applications.

Technology Category

Application Category

📝 Abstract
Recent advances in biometric systems have significantly improved the detection and prevention of fraudulent activities. However, as detection methods improve, attack techniques become increasingly sophisticated. Attacks on face recognition systems can be broadly divided into physical and digital approaches. Traditionally, deep learning models have been the primary defence against such attacks. While these models perform exceptionally well in scenarios for which they have been trained, they often struggle to adapt to different types of attacks or varying environmental conditions. These subsystems require substantial amounts of training data to achieve reliable performance, yet biometric data collection faces significant challenges, including privacy concerns and the logistical difficulties of capturing diverse attack scenarios under controlled conditions. This work investigates the application of Vision Language Models (VLM) and proposes an in-context learning framework for detecting physical presentation attacks and digital morphing attacks in biometric systems. Focusing on open-source models, the first systematic framework for the quantitative evaluation of VLMs in security-critical scenarios through in-context learning techniques is established. The experimental evaluation conducted on freely available databases demonstrates that the proposed subsystem achieves competitive performance for physical and digital attack detection, outperforming some of the traditional CNNs without resource-intensive training. The experimental results validate the proposed framework as a promising tool for improving generalisation in attack detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting physical and digital attacks on face recognition systems
Improving generalization in attack detection with Vision Language Models
Reducing reliance on resource-intensive training for biometric security
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Language Models for attack detection
Implements in-context learning framework
Evaluates VLMs in security scenarios systematically
L
Lazaro Janier Gonzalez-Soler
da/sec - Biometrics and Security Research Group, Darmstadt, Germany
M
Maciej Salwowski
Technical University of Denmark, Denmark
Christoph Busch
Christoph Busch
Professor for Biometrics, Norwegian University of Science and Technology (NTNU)
Biometrics