🤖 AI Summary
To address three key challenges in Vision Transformer (ViT)-based face forgery detection—prohibitive computational cost of full fine-tuning, weak modeling of local forensic cues, and narrow coverage of forgery artifacts—this paper proposes a parameter-efficient and generalizable Mixture-of-Experts (MoE) architecture. Methodologically, we freeze the pre-trained ViT backbone and fine-tune only lightweight LoRA and Adapter modules; integrate global Transformer representations with local CNN priors; and introduce a dynamic routing mechanism to enable multi-granularity forgery pattern modeling and expert-wise lightweight adaptation. Our contribution is the first application of MoE to face forgery detection, enabling plug-and-play transfer across diverse ViT variants. Experiments demonstrate state-of-the-art performance on multiple benchmarks, with over 90% reduction in trainable parameters, significantly improved cross-dataset generalization, and enhanced computational efficiency.
📝 Abstract
Deepfakes have recently raised significant trust issues and security concerns among the public. Compared to CNN face forgery detectors, ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. However, these approaches still exhibit the following limitations: (1) Fully fine-tuning ViT-based models from ImageNet weights demands substantial computational and storage resources; (2) ViT-based methods struggle to capture local forgery clues, leading to model bias; (3) These methods limit their scope on only one or few face forgery features, resulting in limited generalizability. To tackle these challenges, this work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach. MoE-FFD only updates lightweight Low-Rank Adaptation (LoRA) and Adapter layers while keeping the ViT backbone frozen, thereby achieving parameter-efficient training. Moreover, MoE-FFD leverages the expressivity of transformers and local priors of CNNs to simultaneously extract global and local forgery clues. Additionally, novel MoE modules are designed to scale the model's capacity and smartly select optimal forgery experts, further enhancing forgery detection performance. Our proposed learning scheme can be seamlessly adapted to various transformer backbones in a plug-and-play manner. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art face forgery detection performance with significantly reduced parameter overhead. The code is released at: https://github.com/LoveSiameseCat/MoE-FFD.