FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing synthetic image detection methods based on large vision-language models, which rely heavily on large-scale forged data and lack causal reasoning capabilities, often leading to hallucinated explanations. The authors propose a novel framework that integrates physical commonsense knowledge with chain-of-thought reasoning enhanced by critical thinking. Through supervised fine-tuning and group-based relative policy optimization, the model simultaneously generates forgery hypotheses and constructs physically grounded counter-evidence to support authenticity judgments, enabling bidirectional dialectical reasoning. By introducing physical commonsense into the reasoning chain to establish authenticity anchors, the method eliminates dependence on forged training data. It achieves state-of-the-art performance across multiple benchmarks, significantly improving detection accuracy, interpretability, and robustness while effectively mitigating over-rejection of authentic images.
📝 Abstract
The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.
Problem

Research questions and friction points this paper is trying to address.

synthetic image detection
causal reasoning
explanatory hallucination
physical commonsense
critical thinking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
Physical Commonsense
Synthetic Image Detection
GRPO
Interpretable AI