Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

In real-world open-domain scenarios, multimodal large language models (MLLMs) frequently encounter implicit instruction errors—including missing objects, contradictory facts, ambiguous references, or infeasible operations—yet current models, due to strong obedience bias, struggle to proactively detect such unarticulated logical inconsistencies. To address this, we introduce the first benchmark comprising four categories of implicit reasoning diagnostics, systematically evaluating the anomaly awareness capabilities of state-of-the-art MLLMs (e.g., GPT-4o, o3). Empirical results reveal a uniformly low implicit error detection rate (<35%). We further identify that while MLLMs possess latent reasoning capacity, it is suppressed by behavioral inertia. To mitigate this, we propose lightweight inference-time interventions—most notably “clarification questioning”—which elevate average detection accuracy to 82%. This demonstrates both the activatability of MLLMs’ critical questioning ability and the malleability of their response behavior, establishing a novel paradigm for enhancing trustworthy multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments where inputs are messy, underspecified, and not always trustworthy. Unlike curated benchmarks, these settings frequently involve instructions that refer to missing objects or contradictory facts, rely on ambiguous references, or request infeasible actions. In such cases, success hinges not on task execution alone, but on a model's ability to detect when something is silently wrong. This paper presents a systematic analysis of how current MLLMs handle such implicit reasoning scenarios: cases where the flaw is not explicitly stated but must be inferred from context. Using a curated diagnostic suite spanning four categories of real-world failure modes, we evaluate six MLLMs, including o3 and GPT-4o, and find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills. Explicit prompting reveals that the underlying capabilities exist but are often suppressed in favor of user compliance. We further show that simple inference-time interventions, such as cautious persona prompting and, in particular, requiring a clarifying question, can dramatically recover performance. Our findings highlight a persistent gap between reasoning competence and behavioral compliance in current MLLMs and suggest practical strategies for making these models more trustworthy in underconstrained environments.

Problem

Research questions and friction points this paper is trying to address.

Analyzing MLLMs' ability to detect implicit reasoning flaws

Evaluating models' failure to identify hidden issues in inputs

Addressing gap between reasoning skills and behavioral compliance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing implicit reasoning in MLLMs

Using diagnostic suite for failure modes

Inference-time interventions boost performance

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models