M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing benchmarks fall short in scale, domain coverage, and visual complexity, limiting their ability to effectively evaluate the consistency between scientific claims and their multimodal evidence. This work proposes M2-Verify—the first large-scale, cross-disciplinary benchmark for multimodal consistency verification with high visual complexity—spanning 16 scientific domains and comprising 469,000 expert-validated instances derived from PubMed and arXiv. Systematic experiments reveal that state-of-the-art models achieve a Micro-F1 score of 85.8% on low-complexity medical tasks but experience a sharp decline to 61.6% in high-complexity scenarios, such as those involving anatomical structural changes. Furthermore, these models exhibit substantial hallucination when generating explanations, underscoring critical limitations in current multimodal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.

Problem

Research questions and friction points this paper is trying to address.

multimodal claim consistency

scientific argument evaluation

benchmark dataset

visual complexity

domain diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal claim verification

scientific consistency

large-scale benchmark