DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge that scientific vision-language models (VLMs) often rely on linguistic priors, obscuring genuine visual reasoning capabilities. To enable fine-grained diagnosis, the authors introduce DISSECT, a diagnostic benchmark featuring a novel Model Oracle protocol. It systematically evaluates 18 VLMs across five input conditions—Vision+Text, Text-Only, Vision-Only, Human Oracle, and Model Oracle—on a dataset of 12,000 chemistry and biology questions. This design effectively disentangles visual perception from downstream reasoning. Experiments reveal that chemistry tasks are less amenable to linguistic shortcuts; open-source models perform better when reasoning over their own generated image descriptions than over raw images, exposing a bottleneck in perception-reasoning integration. In contrast, closed-source models show no such gap, demonstrating superior multimodal fusion capabilities.

Technology Category

Application Category

📝 Abstract

When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.

Problem

Research questions and friction points this paper is trying to address.

perception-integration gap

vision-language models

visual reasoning

scientific diagrams

diagnostic benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

perception-integration gap

Model Oracle

visual reasoning diagnosis