🤖 AI Summary
Existing image forgery detection methods rely either on low-level artifacts or high-level semantics in isolation, failing to model cross-level interactions between them.
Method: We propose ForenAgent, the first “code-in-the-loop” digital forensics paradigm: a multimodal large language model autonomously invokes, executes, and iteratively refines Python-based image analysis tools to enable joint semantic and pixel-level reasoning. Its core is a dynamic inference loop—global perception → local focus → iterative probing → holistic adjudication—supported by cold-start initialization followed by reinforcement fine-tuning, process-reward-driven inference alignment, and on-the-fly generation and execution of Python tools.
Contribution/Results: We introduce FABench, the first large-scale heterogeneous benchmark for agent-based forensic evaluation. Experiments demonstrate substantial improvements in detection robustness and interpretability, alongside emergent capabilities in tool utilization and reflective reasoning.
📝 Abstract
Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.