🤖 AI Summary
To address weak cross-domain generalization in multimodal face presentation attack detection (PAD) caused by modality bias and domain shift, this paper proposes the MMDA framework. Methodologically, MMDA introduces: (1) MD2A—a novel joint Modality- and Domain-aware Difference Attention mechanism that explicitly models and suppresses both biases; (2) RS2 soft alignment and U-DSA dual-space self-adaptation, enabling fine-grained feature alignment and cross-domain transfer atop CLIP’s zero-shot representations; and (3) integrated multimodal denoising with cross-domain alignment to enhance robustness on unseen domains. Evaluated across four benchmark datasets and diverse cross-domain protocols, MMDA consistently outperforms state-of-the-art methods, achieving average AUC gains of 3.2–5.7 percentage points—demonstrating synergistic improvements in both generalization capability and detection accuracy.
📝 Abstract
Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality-specific biases and domain shifts. To address these challenges, we introduce the extbf{M}ulti extbf{m}odal extbf{D}enoising and extbf{A}lignment ( extbf{MMDA}) framework. By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross-modal alignment. The extbf{M}odality- extbf{D}omain Joint extbf{D}ifferential extbf{A}ttention ( extbf{MD2A}) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention mechanism based on extracted common noise features. Furthermore, the extbf{R}epresentation extbf{S}pace extbf{S}oft ( extbf{RS2}) Alignment strategy utilizes the pre-trained CLIP model to align multi-domain multimodal data into a generalized representation space in a flexible manner, preserving intricate representations and enhancing the model's adaptability to various unseen conditions. We also design a extbf{U}-shaped extbf{D}ual extbf{S}pace extbf{A}daptation ( extbf{U-DSA}) module to enhance the adaptability of representations while maintaining generalization performance. These improvements not only enhance the framework's generalization capabilities but also boost its ability to represent complex representations. Our experimental results on four benchmark datasets under different evaluation protocols demonstrate that the MMDA framework outperforms existing state-of-the-art methods in terms of cross-domain generalization and multimodal detection accuracy. The code will be released soon.