🤖 AI Summary
Existing HAMMER models effectively detect local manipulations in multimodal forgery detection but struggle with global scene inconsistencies, such as foreground-background semantic mismatches. To address this limitation, we propose a lightweight, training-free segmentation-guided scoring method. It leverages person/face segmentation masks to decouple foreground and background regions, employs joint vision-language embeddings for fine-grained cross-modal alignment, and introduces a region-aware consistency scoring mechanism to mitigate label-space bias and local attention constraints. This approach significantly enhances sensitivity to contextual mismatches, improves tampering localization accuracy, and increases model interpretability. Evaluated on the DGM4 benchmark, our method substantially boosts performance on global inconsistency detection while incurring negligible inference overhead. Crucially, it is fully compatible with the original HAMMER architecture—requiring no retraining or fine-tuning—to robustly enhance detection capability.
📝 Abstract
We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs