🤖 AI Summary
To address the insufficient fine-grained local consistency modeling in multimodal media forgery detection and localization (DGM4)—leading to weak forgery perception and unreliable localization—this paper proposes a Context-Semantic Consistency Learning (CSCL) framework. The method introduces (1) a novel dual-decoder cascaded architecture that separately models intra-modal contextual consistency and cross-modal semantic consistency; and (2) a fine-grained supervision mechanism based on heterogeneous token pairs, coupled with a forgery-aware reasoning module. Evaluated on the DGM4 benchmark, CSCL achieves significant improvements in manipulated region localization accuracy, establishing a new state-of-the-art (SOTA). The source code and pre-trained weights are publicly released.
📝 Abstract
To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.