SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing evaluation methods struggle to assess the causal reasoning capabilities of multimodal large language models (MLLMs) in long scientific documents grounded in authentic chains of evidence. To address this, this work proposes the “Fish-in-the-Ocean” paradigm, introducing the SIN-Data corpus that preserves the native interleaved text–image structure of scientific literature and the SIN-Bench benchmark comprising four progressive tasks: evidence discovery, hypothesis verification, evidence-grounded question answering, and evidence-anchored summarization. A novel “No Evidence, No Score” mechanism is introduced to evaluate cross-modal evidence chains along three dimensions: matchability, relevance, and logical coherence. Experiments reveal a pervasive evidence grounding bottleneck among leading models: Gemini-3-Pro achieves the highest overall score (0.573), while GPT-5, despite attaining an answer accuracy of 0.767, exhibits notably low evidence alignment scores, highlighting a significant gap between correctness and traceable evidential support.

Technology Category

Application Category

📝 Abstract

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic"Needle-In-A-Haystack"tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the"Fish-in-the-Ocean"(FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce"No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

long-context scientific understanding

evidence-based reasoning

scientific document evaluation

cross-modal grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fish-in-the-Ocean

evidence chain

multimodal scientific reasoning