PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Subtle inconsistencies among text, figures, and formulas in scientific papers severely undermine reproducibility and credibility, yet existing benchmarks inadequately evaluate models’ ability to detect and reason about such authentic multimodal contradictions. Method: We introduce PRISMM-Bench—the first multimodal inconsistency benchmark grounded in real peer-review annotations—comprising 262 cross-modal contradiction instances extracted from actual scientific publications. Its construction employs a novel three-stage pipeline: peer-review text mining, LLM-assisted filtering, and rigorous human validation, with structured JSON output designed to eliminate multiple-choice shortcut learning and linguistic bias. Contribution/Results: Evaluation across 21 state-of-the-art multimodal large language models reveals uniformly low performance (26.1%–54.2%), exposing for the first time fundamental limitations of current LMMs in scientific-grade multimodal reasoning. PRISMM-Bench thus establishes a critical evaluation infrastructure for developing trustworthy AI research assistants.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking multimodal models' ability to detect scientific inconsistencies

Addressing limitations in existing benchmarks using synthetic errors

Evaluating models' capacity to resolve cross-modal reasoning challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created benchmark using real reviewer-flagged scientific inconsistencies

Introduced structured JSON answers to minimize linguistic biases

Designed three tasks for detecting and correcting multimodal inconsistencies

🔎 Similar Papers

No similar papers found.