ContextGuard-LVLM: Enhancing News Veracity through Fine-grained Cross-modal Contextual Consistency Verification

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of verifying fine-grained cross-modal consistency—specifically in narrative logic, affective valence, and scene-event alignment—between visual and textual elements in digital news. To this end, we propose the first Fine-grained Cross-modal Contextual Consistency (FCCC) detection framework. Methodologically, it employs a multi-stage contextual reasoning mechanism and introduces three annotation dimensions: affective polarity, visual narrative theme, and event-logical coherence, formalized via a novel CTXT entity type. Leveraging vision-language foundation models, the framework integrates reinforcement learning and adversarial training to enhance sensitivity to latent inconsistencies. Experimental results demonstrate that our approach significantly outperforms zero-shot baselines across multiple augmented datasets. It achieves breakthroughs in logical reasoning capability, robustness against input perturbations, and alignment with human expert judgments—establishing a new state of the art in fine-grained multimodal consistency assessment.

Technology Category

Application Category

📝 Abstract
The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including "contextual sentiment," "visual narrative theme," and "scene-event logical coherence," and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.
Problem

Research questions and friction points this paper is trying to address.

Detecting fine-grained cross-modal inconsistencies in news content
Improving contextual alignment between visual and textual information
Enhancing robustness against subtle perturbations in news verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained cross-modal contextual consistency verification
Multi-stage contextual reasoning mechanism
Reinforced or adversarial learning paradigms
🔎 Similar Papers
No similar papers found.
Sihan Ma
Sihan Ma
University of Sydney
Deep LearningComputer VisionRobotics
Q
Qiming Wu
Inner Mongolia University of Science & Technology
R
Ruotong Jiang
Inner Mongolia University of Science & Technology
F
Frank Burns
Federal University of Rio de Janeiro