π€ AI Summary
Existing face forgery detection methods typically employ task-specific, isolated models, leading to computational redundancy and neglecting intrinsic correlations among four key tasks: image classification, video classification, spatial localization, and temporal localization. To address this, we propose OmniFDβthe first unified multi-task framework capable of jointly handling all four tasks. Built upon the Swin Transformer, OmniFD learns a shared 4D spatiotemporal representation and incorporates learnable queries with cross-task attention to enable dynamic dependency modeling and fine-grained knowledge transfer. Its lightweight, task-agnostic head supports unified image/video input and parallel inference. Evaluated on multiple benchmarks, OmniFD achieves a 4.63% improvement in video classification accuracy while reducing model parameters by 63% and training time by 50%, significantly enhancing efficiency, generalization, and scalability.
π Abstract
Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD's advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at https://github.com/haotianll/OmniFD.