FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses evaluation contamination in large reasoning models (LRMs) on automatically verifiable tasks. We propose ROME—the first low-contamination multimodal benchmark for vision-grounded reasoning. Methodologically, we construct an automatically verifiable question-answering dataset spanning both textual and visual modalities, and introduce a contamination-free evaluation framework that rigorously ensures zero overlap between test questions and mainstream training corpora, enabling strict, reproducible reasoning assessment. Experiments reveal significant bottlenecks in current LRMs’ vision–language joint reasoning capabilities, particularly on tasks requiring cross-modal logical inference. All evaluation data, tooling, and results are publicly released, establishing a new paradigm and foundational infrastructure for transparent, trustworthy model evaluation.

Technology Category

Application Category

📝 Abstract
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
Problem

Research questions and friction points this paper is trying to address.

Evaluating large reasoning models on verifiable questions
Testing reasoning capabilities from visual clues
Providing contamination-free assessment of LRMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed contamination-free evaluation benchmark
Released ROME for vision-language reasoning
Tested large reasoning models on verifiable questions
🔎 Similar Papers
No similar papers found.
Bowen Qin
Bowen Qin
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
C
Chen Yue
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
F
Fang Yin
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
H
Hui Wang
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
J
JG Yao
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
J
Jiakang Liu
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
J
Jing-Shu Zheng
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
M
Miguel Hu Chen
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
R
Richeng Xuan
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
S
Shibei Meng
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
S
Shiqi Zhou
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
T
Teng Dai
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
T
Tong-Shuai Ren
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
W
Wei Cui
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
X
Xi Yang
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
X
Xialin Du
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Xiaojing Xu
Xiaojing Xu
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
X
Xue Sun
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
X
Xuejing Li
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Y
Yaming Liu
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Y
Yesheng Liu
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Y
Ying Liu
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Y
Yonghua Lin
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Y
Yu Zhao
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University
Y
Yunduo Zhang
BAAI FlagEval Team, State Key Laboratory of Multimedia Information Processing, Peking University