ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current multimodal large language models (MLLMs) in image forgery detection and localization, which rely heavily on text-centric reasoning and struggle to model invisible low-level manipulation traces, often leading to hallucinations. To overcome this, the authors propose a vision-centric reasoning framework that leverages a forensic toolbox to convert implicit tampering cues into explicit visual intermediate representations. A strategic tool-learning paradigm is introduced, enabling the model to actively select multi-perspective analysis pathways—such as noise residuals, frequency-domain features, and compression history. By integrating gain-driven trajectory construction, supervised fine-tuning, and reinforcement learning optimization, the approach transcends conventional chain-of-thought reasoning, achieving precise pixel-level inconsistency modeling. The method attains state-of-the-art performance in both detection and localization tasks, demonstrating strong generalization, robustness, and minimal tool redundancy.

Technology Category

Application Category

📝 Abstract
Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.
Problem

Research questions and friction points this paper is trying to address.

image forgery detection
multimodal large language models
visual-centric reasoning
tampering localization
low-level inconsistencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-Centric Reasoning
Strategic Tool Learning
Forensic Toolbox
Multimodal Large Language Models
Image Forgery Localization