🤖 AI Summary
To address the digital authenticity crisis precipitated by the proliferation of deepfake images, this paper proposes a hierarchical deep fusion framework for multi-type facial forgery detection. Methodologically, it innovatively integrates four heterogeneous pre-trained architectures—Swin-MLP, CoAtNet, EfficientNetV2, and DaViT—via multi-stage fine-tuning and hierarchical feature concatenation to enable complementary representation learning, thereby substantially enhancing model generalization. Transfer learning and ensemble optimization are conducted on the MultiFFDI dataset. The resulting system achieves a score of 0.96852 on the competition’s private leaderboard, ranking 20th among 184 teams. To the best of our knowledge, this is the first work to systematically unify the architectural advantages of MLPs, CNNs, and Vision Transformers (ViTs) within a single detection framework, establishing a reproducible and efficient paradigm for cross-architecture feature collaboration.
📝 Abstract
The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition's private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks.