Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

๐Ÿ“… 2025-01-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing face anti-spoofing (FAS) methods suffer from poor generalization and limited interpretability. This paper pioneers a paradigm shift by formulating FAS as a visual question answering (VQA) task, leveraging multimodal large language models (MLLMs) to jointly perform discrimination and explanation. We introduce three key innovations: (1) a spoof-aware captioning and filtering strategy that generates noise-robust, semantically precise spoof descriptions; (2) a lopsided language model loss that decouples optimization objectives for binary spoof classification and natural-language explanation generation; and (3) a globally aware connector that enhances cross-level, global visionโ€“language alignment. Evaluated on a rigorous one-to-eleven cross-domain benchmark spanning 12 diverse datasets, our method substantially outperforms state-of-the-art approaches, achieving significant gains in both out-of-distribution generalization and explanation plausibility.

Technology Category

Application Category

๐Ÿ“ Abstract
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Facial Recognition
Unseen Spoof Types
Explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable Facial Anti-Spoofing
Large Model Integration
Descriptive Learning Strategy
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Guosheng Zhang
Department of Computer Vision Technology (VIS), Baidu Inc
Keyao Wang
Keyao Wang
Baidu Inc.
deep learningface-anti spoofingcomputer vision
Haixiao Yue
Haixiao Yue
Baidu Inc.
Face RecognitionFace Anti-SpoofingObject Detection
A
Ajian Liu
CBSR&MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA)
Gang Zhang
Gang Zhang
Tsinghua University
computer vision
K
Kun Yao
Department of Computer Vision Technology (VIS), Baidu Inc
Errui Ding
Errui Ding
Baidu Inc.
computer visionmachine learning
J
Jingdong Wang
Department of Computer Vision Technology (VIS), Baidu Inc