OmniFD: A Unified Model for Versatile Face Forgery Detection

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing face forgery detection methods typically employ task-specific, isolated models, leading to computational redundancy and neglecting intrinsic correlations among four key tasks: image classification, video classification, spatial localization, and temporal localization. To address this, we propose OmniFD—the first unified multi-task framework capable of jointly handling all four tasks. Built upon the Swin Transformer, OmniFD learns a shared 4D spatiotemporal representation and incorporates learnable queries with cross-task attention to enable dynamic dependency modeling and fine-grained knowledge transfer. Its lightweight, task-agnostic head supports unified image/video input and parallel inference. Evaluated on multiple benchmarks, OmniFD achieves a 4.63% improvement in video classification accuracy while reducing model parameters by 63% and training time by 50%, significantly enhancing efficiency, generalization, and scalability.

Technology Category

Application Category

📝 Abstract

Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD's advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at https://github.com/haotianll/OmniFD.

Problem

Research questions and friction points this paper is trying to address.

Develops unified model for multiple face forgery detection tasks simultaneously

Addresses computational redundancy of separate task-specific detection models

Enables cross-task knowledge transfer to improve detection accuracy and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified 4D spatiotemporal encoder for images and videos

Cross-task interaction module with learnable attention queries

Lightweight decoding heads for multi-task predictions

🔎 Similar Papers

Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture