Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Image manipulation localization (IML) faces a fundamental trade-off: supervised methods rely on costly pixel-level annotations, while weakly or unsupervised approaches suffer from limited accuracy and poor interpretability. To address this, we propose the first training-free contextual forensic chain framework—the first to integrate multimodal large language models (MLLMs) into zero-shot image forensics. Our method constructs an object-centric rule base, designs a multi-stage progressive reasoning chain that emulates expert cognitive processes, and incorporates an adaptive filtering mechanism to enable end-to-end, interpretable analysis—from image-level classification to pixel-level localization. Crucially, it requires no fine-tuning and demonstrates strong cross-dataset generalization. On multiple benchmarks, it significantly outperforms existing zero-shot methods and matches or surpasses several weakly supervised and even fully supervised baselines. The framework unifies outputs into three complementary modalities: pixel-level localization maps, image-level binary decisions, and text-based attribution explanations.

Technology Category

Application Category

📝 Abstract

Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

Problem

Research questions and friction points this paper is trying to address.

Detects and localizes image manipulations without training requirements

Addresses limitations of supervised methods needing pixel-level annotations

Provides interpretable forensic analysis through multimodal reasoning chains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using multi-modal language models

Objectified rule construction with adaptive filtering

Multi-step reasoning pipeline mirroring expert workflows

🔎 Similar Papers

Exploring Saliency Bias in Manipulation Detection