Judge Before Answer: Can MLLM Discern the False Premise in Question?

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit poor performance in detecting false premises within image-text pairs, while existing benchmarks lack fine-grained categorization and comprehensive coverage. Method: We introduce the first systematic benchmark for false premise identification, hierarchically classifying false premises into three broad categories and thirteen fine-grained subtypes, and propose an automated, multimodal data construction pipeline. Furthermore, we design a dedicated enhancement framework integrating prompt learning and supervised fine-tuning to improve model robustness. Contribution/Results: Extensive experiments reveal that state-of-the-art MLLMs achieve low average accuracy (<40%) on this benchmark, confirming its difficulty and diagnostic value. Our enhancement framework yields substantial improvements, boosting accuracy by 12.7–28.3 percentage points. This work establishes a new paradigm and practical toolkit for evaluating and improving the logical reliability of MLLM reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.
Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle to identify false premises in questions
Existing benchmarks lack comprehensive categorization and coverage
Developing automated pipeline to enhance false premise recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline constructs comprehensive false premise benchmark
Systematically categorizes premises into three main types
Recognition enhancement framework strengthens MLLM false premise detection
🔎 Similar Papers
No similar papers found.
J
Jidong Li
School of Computer Science, Shanghai Jiao Tong University
L
Lingyong Fang
School of Computer Science, Shanghai Jiao Tong University
Haodong Zhao
Haodong Zhao
Shanghai Jiao Tong University
Federated LearningLLM
S
Sufeng Duan
School of Computer Science, Shanghai Jiao Tong University
G
Gongshen Liu
School of Computer Science, Shanghai Jiao Tong University