Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing

πŸ“… 2025-11-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Cross-domain multimodal face anti-spoofing (FAS) suffers from two invariance risks: (1) instability of modality representations under domain shifts, and (2) over-reliance on domain-specific modality co-occurrence patterns, impairing generalization to unseen attacks. To address these, we propose RiSeβ€”a provably generalizable framework. First, we theoretically characterize how class asymmetry in FAS critically exacerbates generalization error. Second, we design AsyIRM to learn radially invariant decision boundaries, mitigating asymmetry-induced bias. Third, we integrate Modality-Mixing Self-Distillation (MMSD) for self-supervised modality co-occurrence disentanglement, further enhanced by spherical representation learning and cross-sample mixing to strengthen modality decoupling. Extensive experiments on multiple cross-domain multimodal FAS benchmarks demonstrate that RiSe consistently outperforms state-of-the-art methods, empirically validating the tight alignment between our theoretical generalization bound and practical robustness.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.
Problem

Research questions and friction points this paper is trying to address.

Addressing modal representation invariant risk in asymmetric FAS distributions
Mitigating modal synergy invariant risk from spurious inter-modal correlations
Improving cross-domain generalization for multimodal face anti-spoofing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Invariant Risk Minimization for representation risk
Multimodal Synergy Disentanglement for synergy risk
Invariant spherical decision boundary in radial space