Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Multimodal face anti-spoofing (FAS) faces two key challenges: significant cross-domain distribution shifts and missing modalities during inference. To address these, we propose the Cross-Modal Transition-guided Network (CTNet), the first method to explicitly model consistent cross-modal feature transition patterns among live samples, thereby constructing a unified feature space. CTNet detects out-of-distribution attacks via transition inconsistency and reconstructs infrared (IR) and depth modalities under RGB guidance. Our approach jointly integrates cross-modal feature transition learning, consistency regularization, and complementary modality generation within a unified RGB/IR/Depth training and inference framework. This design substantially improves cross-domain generalization and robustness to missing modalities. Extensive experiments demonstrate that CTNet consistently outperforms state-of-the-art dual-class multimodal FAS methods across mainstream benchmarks and evaluation protocols.

Technology Category

Application Category

📝 Abstract

Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.

Problem

Research questions and friction points this paper is trying to address.

Detect genuine human presence using multi-modal data

Address distribution discrepancies in training and testing domains

Handle missing modalities during inference effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal feature transitions for live samples

Inconsistent feature transitions for spoof detection

RGB-derived complementary IR and depth features

🔎 Similar Papers

No similar papers found.