DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that scientific vision-language models (VLMs) often rely on linguistic priors, obscuring genuine visual reasoning capabilities. To enable fine-grained diagnosis, the authors introduce DISSECT, a diagnostic benchmark featuring a novel Model Oracle protocol. It systematically evaluates 18 VLMs across five input conditions—Vision+Text, Text-Only, Vision-Only, Human Oracle, and Model Oracle—on a dataset of 12,000 chemistry and biology questions. This design effectively disentangles visual perception from downstream reasoning. Experiments reveal that chemistry tasks are less amenable to linguistic shortcuts; open-source models perform better when reasoning over their own generated image descriptions than over raw images, exposing a bottleneck in perception-reasoning integration. In contrast, closed-source models show no such gap, demonstrating superior multimodal fusion capabilities.
📝 Abstract
When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.
Problem

Research questions and friction points this paper is trying to address.

perception-integration gap
vision-language models
visual reasoning
scientific diagrams
diagnostic benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

perception-integration gap
Model Oracle
visual reasoning diagnosis
language priors
scientific VLMs
🔎 Similar Papers
No similar papers found.
D
Dikshant Kukreja
IIIT Delhi, India
K
Kshitij Sah
IIIT Delhi, India
K
Karan Goyal
IIIT Delhi, India
Mukesh Mohania
Mukesh Mohania
IIIT Delhi
DatabasesAI for DataTrusted/Autonomous Information Systems
Vikram Goyal
Vikram Goyal
Professor
Big Data AnalyticsData MiningMachine Learning