🤖 AI Summary
Existing DGM4 datasets only cover local manipulations (e.g., face swapping), limiting evaluation of models’ ability to detect global inconsistencies (e.g., foreground-background mismatches). To address this, we introduce DGM4+, a new benchmark comprising 5,000 high-quality multimodal samples. DGM4+ is the first dataset to systematically incorporate globally inconsistent scene manipulations, formally defining three novel tampering categories: FG-BG (foreground-background mismatch), FG-BG+TA (with textual absurdity), and FG-BG+TS (with temporal-spatial inconsistency). Samples are generated using GPT-Image-1 to produce semantically incongruent image-text pairs, followed by perceptual hashing for deduplication, OCR-based text cleaning, and strict control over facial counts to ensure quality. Evaluation shows that DGM4+ significantly improves the detection performance of multimodal forensic models—including HAMMER—on foreground-background mismatches. As the first publicly available benchmark supporting global reasoning assessment, DGM4+ enables rigorous evaluation of holistic consistency awareness in deepfake detection.
📝 Abstract
The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus