LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI-generated face detection methods struggle to model structural consistency across diverse generative paradigms, resulting in poor generalization. To address structural inconsistency, this paper proposes a robust detection framework featuring: (1) a facial landmark-guided region-guided multi-head attention (RG-MHA) mechanism; (2) the novel layer-aware masked modulation (LAMM) module, which enables dynamic hierarchical modulation of regional attention along the ViT depth dimension; and (3) integration of context-aware parameter generation with a dynamic gating mechanism to enhance cross-model generalization. Evaluated across multiple generative models—including both GANs and diffusion models—the framework achieves an average accuracy of 94.09% (+5.45% over SOTA) and an average average precision (AP) of 98.62% (+3.09% over SOTA), demonstrating significantly improved robustness against diverse synthetic face forgeries.

Technology Category

Application Category

📝 Abstract
Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.
Problem

Research questions and friction points this paper is trying to address.

Detecting AI-synthetic faces with diverse generation techniques
Addressing inconsistencies in facial regions across generative models
Improving generalization in facial forgery detection methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-aware Mask Modulation for dynamic adaptation
Region-Guided Multi-Head Attention for facial inconsistencies
Vision Transformer with hierarchical forgery detection
🔎 Similar Papers
No similar papers found.
J
Jiangling Zhang
Wuhan University of Technology
W
Weijie Zhu
Wuhan University of Technology
J
Jirui Huang
Wuhan University of Technology
Yaxiong Chen
Yaxiong Chen
Wuhan University of Technology
deep hashing、deep learning