🤖 AI Summary
To address the limited generalization of face anti-spoofing methods against unknown spoofing attacks, this paper proposes the first zero-shot defense framework that requires no spoofed samples. Methodologically, it leverages only single-source bona fide facial images and implicitly models the semantic distribution of unseen attacks via the CLIP vision-language model. By performing differentiable text prompt optimization—jointly incorporating relaxed prior constraints and semantic independence regularization—it learns diverse, semantically distant textual prompts from genuine faces, enabling open-set modeling of previously unseen attacks. This work pioneers the integration of prompt learning into zero-shot generalization for face anti-spoofing. Evaluated on nine cross-domain benchmarks, the method achieves state-of-the-art performance, significantly enhancing zero-shot robustness against unknown attack types without relying on any spoofed training data.
📝 Abstract
Face anti-spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision-language models, thereby enhancing the model's ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision-language models, enabling state-of-the-art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.