Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models

📅 2024-03-29

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current large multimodal models (LMMs) lack rigorous evaluation of robust understanding—particularly their ability to recognize and abstain from answering unsolvable multimodal questions. Method: This work introduces “Unsolvable Problem Detection (UPD)” as a novel benchmarking task, encompassing three canonical unsolvable scenarios: Answer-Absent Detection (AAD), Incompatible-Option Detection (IASD), and Image-Question Mismatch Detection (IVQD). We establish the first systematic UPD evaluation paradigm, release MM-UPD Bench—the first dedicated UPD benchmark—and propose a multimodal assessment framework integrating multimodal prompting, chain-of-thought (CoT) reasoning, and self-reflection mechanisms. Contribution/Results: Experiments reveal substantial performance degradation of state-of-the-art LMMs on UPD, exposing critical blind spots in their trustworthy multimodal understanding. CoT and self-reflection significantly improve detection accuracy. By moving beyond conventional accuracy-based evaluation, this work establishes a new standard for assessing LMM reliability and provides actionable directions for enhancing model trustworthiness.

Technology Category

Application Category

📝 Abstract

This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $ extbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating robust understanding in Large Multimodal Models via Unsolvable Problem Detection

Assessing model ability to withhold answers for unsolvable multiple-choice questions

Identifying bottlenecks in LMMs using MM-UPD Bench for trustworthiness improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Unsolvable Problem Detection (UPD) task

Develops MM-UPD Bench for robust evaluation

Uses chain-of-thought and self-reflection techniques

🔎 Similar Papers

No similar papers found.