🤖 AI Summary
This study addresses the challenge of verifying fine-grained cross-modal consistency—specifically in narrative logic, affective valence, and scene-event alignment—between visual and textual elements in digital news. To this end, we propose the first Fine-grained Cross-modal Contextual Consistency (FCCC) detection framework. Methodologically, it employs a multi-stage contextual reasoning mechanism and introduces three annotation dimensions: affective polarity, visual narrative theme, and event-logical coherence, formalized via a novel CTXT entity type. Leveraging vision-language foundation models, the framework integrates reinforcement learning and adversarial training to enhance sensitivity to latent inconsistencies. Experimental results demonstrate that our approach significantly outperforms zero-shot baselines across multiple augmented datasets. It achieves breakthroughs in logical reasoning capability, robustness against input perturbations, and alignment with human expert judgments—establishing a new state of the art in fine-grained multimodal consistency assessment.
📝 Abstract
The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including "contextual sentiment," "visual narrative theme," and "scene-event logical coherence," and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.