Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the challenges of iris presentation attack detection, including scarcity of unknown attack samples, high annotation costs, and privacy sensitivity. The authors propose a privacy-preserving multimodal large language model approach that, for the first time, integrates human expert-provided saliency knowledge into structured prompts to effectively mitigate ambiguity in attack categorization—without requiring the upload of raw biometric data. By combining a pretrained Vision Transformer with compliantly deployed Gemini 2.5 Pro and Llama 3.2-Vision models, the method achieves state-of-the-art performance: on a dataset of 224 iris images spanning seven attack types, the Gemini model outperforms both specialized CNNs and human experts, while the locally run Llama model attains near-human-level accuracy.

Technology Category

Application Category

📝 Abstract

Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.

Problem

Research questions and friction points this paper is trying to address.

iris presentation attack detection

biometric security

privacy constraints

multimodal LLMs

unknown attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Iris Presentation Attack Detection

Human Salience