Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This study investigates the failure mechanisms of multimodal large language models when confronted with conflicts between textual premises and audiovisual perceptions. To this end, the authors introduce IMAVB, a benchmark dataset comprising audiovisually contradictory scenarios, and conduct a systematic analysis through multimodal evaluation, hidden-state probing, seven prompt variants, and a novel Probe-Guided Logit Adjustment (PGLA) method. The work uncovers a pervasive “representation–action gap”: while internal representations encode perceptual conflicts, model outputs often fail to reject erroneous premises. The study further identifies emergent phenomena such as modality asymmetry and prompt robustness. Experiments across nine state-of-the-art models demonstrate that PGLA substantially enhances the ability to reject false premises without compromising general comprehension performance.
📝 Abstract
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
Problem

Research questions and friction points this paper is trying to address.

omnimodal LLMs
perception-action gap
premise-perception conflict
grounding
multimodal comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation-Action Gap
omnimodal LLMs
conflict detection
PGLA
IMAVB benchmark
🔎 Similar Papers