🤖 AI Summary
To address the challenges of low sensitivity to diffusion-generated forgeries, high computational cost, and difficulty in jointly capturing global and local cues in high-resolution image forgery localization, this paper proposes an efficient three-stage lightweight architecture. Its key contributions are: (1) constructing the first large-scale (thousand-level) high-resolution Synthetic Image Forgery (SIF) dataset, specifically covering diffusion-model-generated forgeries; (2) designing EfficientRWKV—a parallel hybrid state-space network integrating state-space modeling with lightweight attention—to enable synergistic global-local feature capture; and (3) introducing a multi-scale supervision strategy to enhance hierarchical prediction consistency. Evaluated on both our proprietary and standard benchmarks, the method achieves state-of-the-art performance, significantly outperforming lightweight ViT-based models in localization accuracy, computational efficiency (42% fewer FLOPs), and inference speed (32 FPS at 4K resolution), demonstrating strong suitability for real-time digital forensics.
📝 Abstract
With imaging devices delivering ever-higher resolutions and the emerging diffusion-based forgery methods, current detectors trained only on traditional datasets (with splicing, copy-moving and object removal forgeries) lack exposure to this new manipulation type. To address this, we propose a novel high-resolution SIF dataset of 1200+ diffusion-generated manipulations with semantically extracted masks. However, this also imposes a challenge on existing methods, as they face significant computational resource constraints due to their prohibitive computational complexities. Therefore, we propose a novel EfficientIML model with a lightweight, three-stage EfficientRWKV backbone. EfficientRWKV's hybrid state-space and attention network captures global context and local details in parallel, while a multi-scale supervision strategy enforces consistency across hierarchical predictions. Extensive evaluations on our dataset and standard benchmarks demonstrate that our approach outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs and inference speed, underscoring its suitability for real-time forensic applications.