Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

📅 2024-06-15

📈 Citations: 5

✨ Influential: 1

🤖 AI Summary

This work identifies a critical deficiency in multimodal large language models (MLLMs): correct visual perception co-occurring with incorrect answers—particularly under indirect or misleading questions—accompanied by consistently low visual token attention. To diagnose this issue, the authors introduce the first fine-grained benchmark comprising 12 semantically distinct categories, leveraging attention map and logits distribution analysis to pinpoint root causes. They propose three novel interventions: (1) paired positive–negative sample construction to sharpen visual–linguistic alignment; (2) content-guided textual prompting to reinforce semantic fidelity; and (3) question-driven visual attention enhancement to prioritize task-relevant regions. Evaluated across 15 state-of-the-art MLLMs, the methods significantly improve robustness against misleading queries, yield more rational visual attention allocation, and reduce overall error rates substantially.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that 'directly' relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs' attention to visual tokens is notably lower than to system and question tokens. We further observe that attention scores between questions and visual tokens as well as the model's confidence in the answers are lower in response to misleading questions than to straightforward ones. To address the first challenge, we introduce a paired positive and negative data construction pipeline to diversify the dataset. For the second challenge, we propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt. For the text prompt, we propose a content guided refinement strategy that performs preliminary visual content analysis to generate structured information before answering the question. Additionally, we employ a visual attention refinement strategy that highlights question-relevant visual tokens to increase the model's attention to visual content that aligns with the question. Extensive experiments demonstrate that these challenges can be significantly mitigated with our proposed dataset and techniques.

Problem

Research questions and friction points this paper is trying to address.

MLLMs generate incorrect answers despite understanding visual content.

Instruction tuning datasets bias MLLMs towards direct visual questions.

MLLMs exhibit low attention to visual tokens compared to text tokens.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Paired positive-negative data construction pipeline

Content guided text prompt refinement strategy

Visual attention refinement strategy for decoding

🔎 Similar Papers

No similar papers found.

Authors to Follow