OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

📅 2024-06-18
🏛️ Neural Information Processing Systems
📈 Citations: 12
Influential: 1
📄 PDF
🤖 AI Summary
Current large language models (LLMs) and multimodal large models (LMMs) lack rigorous evaluation frameworks for complex scientific reasoning and cross-disciplinary discovery. Method: We introduce OlympicArena, the first multidisciplinary, multimodal benchmark designed for assessing cognitive reasoning in superintelligent AI—covering seven scientific domains and 62 international Olympiad competitions, with 11,163 bilingual problems rigorously curated to prevent data leakage. It uniquely adopts Olympiad problems as fine-grained cognitive reasoning probes and pioneers process-level reasoning evaluation alongside cross-modal joint analysis. Contribution/Results: We release an open-source annotation platform, fine-grained evaluation toolkit, and automated leaderboard. Experiments reveal that state-of-the-art models—including GPT-4o—achieve only 39.97% overall accuracy, exposing critical bottlenecks in cross-disciplinary deep reasoning and multimodal synergy. This shifts the evaluation paradigm from answer correctness toward interpretability and reasoning transparency.

Technology Category

Application Category

📝 Abstract
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI's cognitive reasoning across multiple disciplines.
Assessing AI performance in text and multimodal problem-solving.
Identifying limitations in AI's complex reasoning and multimodal integration.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OlympicArena for AI cognitive reasoning evaluation
Utilizes 11,163 bilingual text and image problems
Provides tools for AI research and performance analysis
🔎 Similar Papers
No similar papers found.
Z
Zhen Huang
Generative AI Research Lab (GAIR)
Zengzhi Wang
Zengzhi Wang
Shanghai Jiao Tong University
Data EngineeringComplex ReasoningLarge Language ModelsNatural Language Processing
Shijie Xia
Shijie Xia
Shanghai Jiao Tong University
Natural Language Processing
X
Xuefeng Li
Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
Haoyang Zou
Haoyang Zou
Undergrad, Fudan University
Natural Language ProcessingMachine LearningGenerative AILarge Language Models
Ruijie Xu
Ruijie Xu
ShanghaiTech University
Machine LearningComputer VisionRLHF
Run-Ze Fan
Run-Ze Fan
University of Massachusetts Amherst
LLMData EngineeringReasoning
Lyumanshan Ye
Lyumanshan Ye
Shanghai Jiao Tong Univeristy
Human-Computer Interaction
Ethan Chern
Ethan Chern
Shanghai Jiao Tong University
Machine LearningNatural Language ProcessingArtificial Intelligence
Y
Yixin Ye
Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
Yikai Zhang
Yikai Zhang
Fudan university
Natural Language ProcessingAutonomous Agent
Y
Yuqing Yang
Generative AI Research Lab (GAIR)
T
Ting Wu
Generative AI Research Lab (GAIR)
B
Binjie Wang
Generative AI Research Lab (GAIR)
Shichao Sun
Shichao Sun
Generative AI Research Lab (GAIR)
Y
Yang Xiao
Generative AI Research Lab (GAIR)
Yiyuan Li
Yiyuan Li
University of North Carolina at Chapel Hill
Natural Language ProcessingComputational Linguistics
F
Fan Zhou
Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
Steffi Chern
Steffi Chern
University of Pennsylvania
Natural Language ProcessingArtificial Intelligence
Y
Yiwei Qin
Generative AI Research Lab (GAIR)
Y
Yan Ma
Generative AI Research Lab (GAIR)
J
Jiadi Su
Generative AI Research Lab (GAIR)
Yixiu Liu
Yixiu Liu
Master student at Shanghai Jiao Tong University
Yuxiang Zheng
Yuxiang Zheng
上海交通大学
Shaoting Zhang
Shaoting Zhang
Shanghai AI Lab; SenseTime Research
Medical Image AnalysisComputer VisionFoundation Models
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory
P
Pengfei Liu
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Generative AI Research Lab (GAIR)