DGM4+: Dataset Extension for Global Scene Inconsistency

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing DGM4 datasets only cover local manipulations (e.g., face swapping), limiting evaluation of models’ ability to detect global inconsistencies (e.g., foreground-background mismatches). To address this, we introduce DGM4+, a new benchmark comprising 5,000 high-quality multimodal samples. DGM4+ is the first dataset to systematically incorporate globally inconsistent scene manipulations, formally defining three novel tampering categories: FG-BG (foreground-background mismatch), FG-BG+TA (with textual absurdity), and FG-BG+TS (with temporal-spatial inconsistency). Samples are generated using GPT-Image-1 to produce semantically incongruent image-text pairs, followed by perceptual hashing for deduplication, OCR-based text cleaning, and strict control over facial counts to ensure quality. Evaluation shows that DGM4+ significantly improves the detection performance of multimodal forensic models—including HAMMER—on foreground-background mismatches. As the first publicly available benchmark supporting global reasoning assessment, DGM4+ enables rigorous evaluation of holistic consistency awareness in deepfake detection.

Technology Category

Application Category

📝 Abstract
The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus
Problem

Research questions and friction points this paper is trying to address.

Addressing global scene inconsistencies in multimodal disinformation detection
Extending datasets to include foreground-background mismatches and hybrid manipulations
Creating benchmarks for testing detectors on local and global reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended dataset with foreground-background mismatch samples
Used GPT-generated images and controlled prompts
Implemented quality control with face detection and deduplication
🔎 Similar Papers
No similar papers found.