Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

📅 2024-11-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from cognitive–perceptual (C&P) knowledge conflict in document understanding—where linguistic reasoning (cognition) misaligns with OCR-derived visual perception—causing answer–image mismatches and degrading both performance and interpretability. This work formally defines and systematically evaluates C&P knowledge conflict for the first time. We propose a multi-stage knowledge consistency fine-tuning framework: (1) task-specific supervised joint alignment of OCR and visual question answering (VQA), followed by (2) perceptual feature constraints and cognitive logical regularization to enforce cross-modal consistency. Evaluated via an OCR–VQA joint diagnostic framework, our method boosts C&P consistency of models like GPT-4o from 68.6% to significantly higher levels, while simultaneously improving VQA accuracy and OCR post-processing quality across multiple document understanding benchmarks. This establishes a novel paradigm for trustworthy multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it"sees"and what it"understands."Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 68.6% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. This method first ensures task-specific consistency and then connects the cognitive and perceptual knowledge. Our method significantly reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks in most scenarios.

Problem

Research questions and friction points this paper is trying to address.

Assessing conflicts between cognition and perception in MLLMs

Mitigating multimodal knowledge conflicts in document understanding

Improving consistency between visual content and model understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines C&P knowledge conflicts in MLLMs

Proposes Multimodal Knowledge Consistency Fine-tuning

Improves C&P consistency and task performance

🔎 Similar Papers

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs