🤖 AI Summary
This work addresses the limitations of current multimodal large language models (MLLMs) in image forgery detection and localization, which rely heavily on text-centric reasoning and struggle to model invisible low-level manipulation traces, often leading to hallucinations. To overcome this, the authors propose a vision-centric reasoning framework that leverages a forensic toolbox to convert implicit tampering cues into explicit visual intermediate representations. A strategic tool-learning paradigm is introduced, enabling the model to actively select multi-perspective analysis pathways—such as noise residuals, frequency-domain features, and compression history. By integrating gain-driven trajectory construction, supervised fine-tuning, and reinforcement learning optimization, the approach transcends conventional chain-of-thought reasoning, achieving precise pixel-level inconsistency modeling. The method attains state-of-the-art performance in both detection and localization tasks, demonstrating strong generalization, robustness, and minimal tool redundancy.
📝 Abstract
Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.