Robust Multimodal Large Language Models Against Modality Conflict

๐Ÿ“… 2025-07-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work identifies intrinsic modality conflict between vision and language representations in multimodal large language models (MLLMs) as a primary cause of hallucinationโ€”a problem previously underexplored. To systematically characterize cross-modal inconsistency, we introduce MMMC, the first benchmark explicitly designed for modality conflict modeling. Methodologically, we propose a three-tiered mitigation framework integrating prompt engineering, supervised fine-tuning (SFT), and reinforcement learning (RL). Extensive experiments on MMMC demonstrate that RL achieves the largest reduction in hallucination rates, while SFT exhibits superior stability and generalization across diverse conflict scenarios. Our study not only uncovers the profound impact of modality conflict on MLLM robustness but also establishes a reproducible evaluation framework and effective intervention strategies. By bridging theoretical insight with empirical validation, this work lays a foundational step toward trustworthy multimodal AI.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Investigates hallucinations in MLLMs caused by modality conflict
Defines modality conflict and constructs MMMC dataset for analysis
Proposes methods to mitigate hallucinations using prompt engineering, fine-tuning, and RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines modality conflict in MLLMs formally
Constructs MMMC dataset for vision-language tasks
Proposes RL method to mitigate hallucinations best
๐Ÿ”Ž Similar Papers
No similar papers found.