Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient fine-grained local consistency modeling in multimodal media forgery detection and localization (DGM4)—leading to weak forgery perception and unreliable localization—this paper proposes a Context-Semantic Consistency Learning (CSCL) framework. The method introduces (1) a novel dual-decoder cascaded architecture that separately models intra-modal contextual consistency and cross-modal semantic consistency; and (2) a fine-grained supervision mechanism based on heterogeneous token pairs, coupled with a forgery-aware reasoning module. Evaluated on the DGM4 benchmark, CSCL achieves significant improvements in manipulated region localization accuracy, establishing a new state-of-the-art (SOTA). The source code and pre-trained weights are publicly released.

Technology Category

Application Category

📝 Abstract
To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.
Problem

Research questions and friction points this paper is trying to address.

Detect and ground multi-modal media manipulation
Enhance fine-grained forgery perception ability
Improve consistency learning for reliable results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual-Semantic Consistency Learning (CSCL) approach
Two-branch cascaded decoders for modalities
Consistency features with heterogeneous supervision
🔎 Similar Papers
No similar papers found.
Y
Yiheng Li
MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yang Yang
MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Zichang Tan
Zichang Tan
Previously CASIA, Baidu Inc.;
Computer VisionBiometricsAutonomous DrivingRoboticsMLLM
H
Huan Liu
Beijing Jiaotong University
Weihua Chen
Weihua Chen
Alibaba DAMO Academy, previously NLPR, CASIA
Computer Vision
X
Xu Zhou
Sangfor Technologies Inc.
Zhen Lei
Zhen Lei
Associate Professor, OSCO Research Chair in Off-site Construction
Offsite ConstructionConstruction Engineering and Management