SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing HAMMER models effectively detect local manipulations in multimodal forgery detection but struggle with global scene inconsistencies, such as foreground-background semantic mismatches. To address this limitation, we propose a lightweight, training-free segmentation-guided scoring method. It leverages person/face segmentation masks to decouple foreground and background regions, employs joint vision-language embeddings for fine-grained cross-modal alignment, and introduces a region-aware consistency scoring mechanism to mitigate label-space bias and local attention constraints. This approach significantly enhances sensitivity to contextual mismatches, improves tampering localization accuracy, and increases model interpretability. Evaluated on the DGM4 benchmark, our method substantially boosts performance on global inconsistency detection while incurring negligible inference overhead. Crucially, it is fully compatible with the original HAMMER architecture—requiring no retraining or fine-tuning—to robustly enhance detection capability.

Technology Category

Application Category

📝 Abstract

We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs

Problem

Research questions and friction points this paper is trying to address.

Detects global scene inconsistencies in manipulated multimodal content

Addresses foreground-background mismatch using segmentation-guided scoring

Improves multimodal manipulation detection without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmentation masks separate foreground and background regions

Extracts embeddings using joint vision-language model

Computes region-aware coherence scores for fusion

🔎 Similar Papers

Towards Generalizable Scene Change Detection