Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This work addresses the limitations of existing face anti-spoofing (FAS) methods, which often exhibit poor generalization to unseen domains, incur high computational costs when leveraging vision-language models, and rely heavily on high-quality visual features. The authors systematically evaluate 15 pretrained vision-only foundation models on cross-domain FAS tasks and introduce a novel, efficient, and robust purely visual baseline by integrating FAS-Aug with patch-wise data augmentation and an attention-weighted patch loss. Their analysis reveals that self-supervised Vision Transformers—particularly DINOv2 with Registers—effectively suppress attention artifacts and capture fine-grained spoofing cues. The proposed method achieves state-of-the-art performance under the MICO protocol and significantly outperforms existing approaches under the data-constrained LSD protocol, all while maintaining superior computational efficiency.

Technology Category

Application Category

📝 Abstract
Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .
Problem

Research questions and friction points this paper is trying to address.

Face Anti-Spoofing
Domain Generalization
Vision Foundation Models
Cross-domain
Self-supervised Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised vision transformers
domain generalization
face anti-spoofing
vision-only foundation models
attention-weighted patch loss
🔎 Similar Papers
No similar papers found.