THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess the ability of multimodal large language models (MLLMs) to reason about image-based fraud in authentic academic contexts. To address this gap, this work introduces a multitask benchmark comprising over 4,000 questions, integrating real retracted papers and synthetically generated data across seven academic scenarios, five categories of scientific fraud, and 16 fine-grained image manipulation techniques. Crucially, it establishes the first mapping between fraud types and five core reasoning capabilities, enabling multidimensional and fine-grained evaluation of MLLMs’ visual fraud reasoning. Experiments on 16 leading models reveal significant limitations: even the top-performing GPT-5 achieves only 56.15% accuracy, underscoring the substantial challenges current models face in handling such complex, real-world tasks.

Technology Category

Application Category

📝 Abstract
We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

scientific paper fraud
multimodal large language models
visual fraud reasoning
real-world scenarios
fraud forensics
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models
scientific fraud detection
visual reasoning benchmark
image manipulation analysis
holistic evaluation
🔎 Similar Papers
No similar papers found.
T
Tzu-Yen Ma
Beijing University of Posts and Telecommunications
B
Bo Zhang
Beijing University of Posts and Telecommunications
Z
Zichen Tang
Beijing University of Posts and Telecommunications
J
Junpeng Ding
Beijing University of Posts and Telecommunications
H
Haolin Tian
Beijing University of Posts and Telecommunications
Y
Yuanze Li
Beijing University of Posts and Telecommunications
Z
Zhuodi Hao
Beijing University of Posts and Telecommunications
Z
Zixin Ding
Beijing University of Posts and Telecommunications
Z
Zirui Wang
Beijing University of Posts and Telecommunications
X
Xinyu Yu
Beijing University of Posts and Telecommunications
S
Shiyao Peng
Beijing University of Posts and Telecommunications
Y
Yizhuo Zhao
Beijing University of Posts and Telecommunications
R
Ruomeng Jiang
Beijing University of Posts and Telecommunications
Yiling Huang
Yiling Huang
PhD Student in Statistics, University of Michigan
Selective Inference
P
Peizhi Zhao
Beijing University of Posts and Telecommunications
J
Jiayuan Chen
Beijing University of Posts and Telecommunications
W
Weisheng Tan
Beijing University of Posts and Telecommunications
H
Haocheng Gao
Beijing University of Posts and Telecommunications
Yang Liu
Yang Liu
Professor of Beijing University of Posts and Telecommunications
AI Chips and Networks
J
Jiacheng Liu
Beijing University of Posts and Telecommunications
Z
Zhongjun Yang
Beijing University of Posts and Telecommunications
J
Jiayu Huang
Beijing University of Posts and Telecommunications
H
Haihong E
Beijing University of Posts and Telecommunications