MuSciClaims: Multimodal Scientific Claim Verification

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks a benchmark directly evaluating multimodal scientific claim verification. This paper introduces SciVQA—the first multimodal claim verification benchmark for scientific literature—constructed by automatically extracting figure–text pairs from papers and injecting fine-grained semantic perturbations manually to generate supporting and contradictory claims, thereby systematically assessing models’ cross-modal reasoning and evidence localization capabilities. Key contributions include: (1) defining and releasing the first diagnosable multimodal scientific verification benchmark; (2) proposing a controllable perturbation framework covering localization, aggregation, and chart understanding; and (3) designing a multi-level diagnostic task suite with joint F1/bias evaluation. Experiments reveal limited performance of current vision-language models (F1 = 0.3–0.5), with the best model achieving only 0.77; all exhibit overconfidence in “support” predictions. Diagnostic analysis uncovers fundamental deficiencies in cross-modal alignment and chart element parsing.

Technology Category

Application Category

📝 Abstract
Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.
Problem

Research questions and friction points this paper is trying to address.

Lack of multimodal benchmarks for scientific claim verification
Need to verify claims using multimodal data from figures
Existing models perform poorly on nuanced claim perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically extract and perturb scientific claims
Introduce multimodal benchmark with diagnostics
Evaluate vision-language models on claim verification
🔎 Similar Papers
No similar papers found.