Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of iris presentation attack detection, including scarcity of unknown attack samples, high annotation costs, and privacy sensitivity. The authors propose a privacy-preserving multimodal large language model approach that, for the first time, integrates human expert-provided saliency knowledge into structured prompts to effectively mitigate ambiguity in attack categorization—without requiring the upload of raw biometric data. By combining a pretrained Vision Transformer with compliantly deployed Gemini 2.5 Pro and Llama 3.2-Vision models, the method achieves state-of-the-art performance: on a dataset of 224 iris images spanning seven attack types, the Gemini model outperforms both specialized CNNs and human experts, while the locally run Llama model attains near-human-level accuracy.

Technology Category

Application Category

📝 Abstract
Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.
Problem

Research questions and friction points this paper is trying to address.

iris presentation attack detection
biometric security
privacy constraints
multimodal LLMs
unknown attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Iris Presentation Attack Detection
Human Salience
Privacy-Constrained Biometrics
Vision Transformer Embeddings
🔎 Similar Papers
No similar papers found.
J
Jacob Piland
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
B
Byron Dowling
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
C
Christopher Sweet
Center for Research Computing, University of Notre Dame, Notre Dame, IN 46556, USA
Adam Czajka
Adam Czajka
University of Notre Dame
BiometricsComputer VisionIris RecognitionPresentation Attack DetectionPost-mortem Biometrics