Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing CLIP-based face anti-spoofing (FAS) methods suffer from weak cross-domain generalization due to reliance on single-text prompts (e.g., “real”/“fake”) and underutilization of patch-level visual cues. To address this, we propose MVP-FAS: a novel framework featuring (1) a multi-view slot attention mechanism that adaptively focuses on fine-grained forgery patterns within CLIP’s image patch embeddings; and (2) a multi-text block alignment module that constructs diverse semantic prompts via synonym-based textual paraphrasing, thereby enhancing robustness of cross-modal text–image alignment. By jointly modeling local forged textures and global semantic consistency, MVP-FAS achieves state-of-the-art performance across multiple cross-domain FAS benchmarks—reducing average ACER by 12.3% over prior methods. Extensive experiments validate its superior generalization capability and high-accuracy spoof detection.

Technology Category

Application Category

📝 Abstract

Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.

Problem

Research questions and friction points this paper is trying to address.

Exploiting CLIP's patch tokens for spoof detection

Overcoming single text prompt generalization limits

Enhancing cross-domain face anti-spoofing performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-View Slot attention extracts spatial features

Multi-Text Patch Alignment enhances semantic robustness

Paraphrased texts reduce domain-specific dependence

🔎 Similar Papers

No similar papers found.

Authors to Follow