🤖 AI Summary
Multimodal face anti-spoofing (FAS) faces two key challenges: significant cross-domain distribution shifts and missing modalities during inference. To address these, we propose the Cross-Modal Transition-guided Network (CTNet), the first method to explicitly model consistent cross-modal feature transition patterns among live samples, thereby constructing a unified feature space. CTNet detects out-of-distribution attacks via transition inconsistency and reconstructs infrared (IR) and depth modalities under RGB guidance. Our approach jointly integrates cross-modal feature transition learning, consistency regularization, and complementary modality generation within a unified RGB/IR/Depth training and inference framework. This design substantially improves cross-domain generalization and robustness to missing modalities. Extensive experiments demonstrate that CTNet consistently outperforms state-of-the-art dual-class multimodal FAS methods across mainstream benchmarks and evaluation protocols.
📝 Abstract
Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.