Propose and Rectify: A Forensics-Driven MLLM Framework for Image Manipulation Localization

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) excel at semantic reasoning but struggle to perceive low-level forensic artifacts, leading to inaccurate tampering localization. To address this, we propose a two-stage forensics-driven framework: first, an enhanced LLaVA generates semantic-level tampering proposals; second, a forensic correction module injects multi-scale filtering–extracted low-level forgery features into SAM’s image embeddings to jointly optimize segmentation. This work establishes the first explicit alignment and fusion of semantic understanding with forensic features. Our method achieves state-of-the-art detection and localization accuracy across multiple benchmarks, significantly improving robustness against complex manipulations—including splicing and copy-move—and enhancing cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

The increasing sophistication of image manipulation techniques demands robust forensic solutions that can both reliably detect alterations and precisely localize tampered regions. Recent Multimodal Large Language Models (MLLMs) show promise by leveraging world knowledge and semantic understanding for context-aware detection, yet they struggle with perceiving subtle, low-level forensic artifacts crucial for accurate manipulation localization. This paper presents a novel Propose-Rectify framework that effectively bridges semantic reasoning with forensic-specific analysis. In the proposal stage, our approach utilizes a forensic-adapted LLaVA model to generate initial manipulation analysis and preliminary localization of suspicious regions based on semantic understanding and contextual reasoning. In the rectification stage, we introduce a Forensics Rectification Module that systematically validates and refines these initial proposals through multi-scale forensic feature analysis, integrating technical evidence from several specialized filters. Additionally, we present an Enhanced Segmentation Module that incorporates critical forensic cues into SAM's encoded image embeddings, thereby overcoming inherent semantic biases to achieve precise delineation of manipulated regions. By synergistically combining advanced multimodal reasoning with established forensic methodologies, our framework ensures that initial semantic proposals are systematically validated and enhanced through concrete technical evidence, resulting in comprehensive detection accuracy and localization precision. Extensive experimental validation demonstrates state-of-the-art performance across diverse datasets with exceptional robustness and generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Detecting and localizing manipulated regions in images

Bridging semantic reasoning with forensic-specific analysis

Overcoming semantic biases for precise manipulation delineation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Forensics-adapted LLaVA model for initial analysis

Multi-scale forensic feature rectification module

Enhanced segmentation with forensic cues integration

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models