Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CLIP-based face anti-spoofing (FAS) methods suffer from weak cross-domain generalization due to reliance on single-text prompts (e.g., “real”/“fake”) and underutilization of patch-level visual cues. To address this, we propose MVP-FAS: a novel framework featuring (1) a multi-view slot attention mechanism that adaptively focuses on fine-grained forgery patterns within CLIP’s image patch embeddings; and (2) a multi-text block alignment module that constructs diverse semantic prompts via synonym-based textual paraphrasing, thereby enhancing robustness of cross-modal text–image alignment. By jointly modeling local forged textures and global semantic consistency, MVP-FAS achieves state-of-the-art performance across multiple cross-domain FAS benchmarks—reducing average ACER by 12.3% over prior methods. Extensive experiments validate its superior generalization capability and high-accuracy spoof detection.

Technology Category

Application Category

📝 Abstract
Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.
Problem

Research questions and friction points this paper is trying to address.

Exploiting CLIP's patch tokens for spoof detection
Overcoming single text prompt generalization limits
Enhancing cross-domain face anti-spoofing performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-View Slot attention extracts spatial features
Multi-Text Patch Alignment enhances semantic robustness
Paraphrased texts reduce domain-specific dependence
🔎 Similar Papers
No similar papers found.
J
Jeongmin Yu
Yonsei University
S
Susang Kim
Yonsei University, POSCO DX
K
Kisu Lee
Yonsei University
Taekyoung Kwon
Taekyoung Kwon
Yonsei University
Won-Yong Shin
Won-Yong Shin
Professor, CSE at Yonsei University
data miningmachine learninginformation theorymobile computingwireless networking
H
Ha Young Kim
Yonsei University