Evaluation of OpenAI o1: Opportunities and Challenges of AGI

📅 2024-09-27
🏛️ arXiv.org
📈 Citations: 99
Influential: 3
📄 PDF
🤖 AI Summary
This study systematically evaluates the OpenAI o1-preview model’s capabilities on complex, cross-disciplinary reasoning tasks—spanning computer science, mathematics, medicine, and linguistics—to assess critical progress toward and bottlenecks hindering artificial general intelligence (AGI). Method: We adopt a multidisciplinary, multi-granularity evaluation framework integrating standardized benchmarks, domain-expert blind evaluation, human-in-the-loop assessment, and interpretability analysis across high-stakes scenarios: programming competitions, mathematical theorem proving, radiology report generation, chip EDA script synthesis, and financial modeling. Contribution/Results: We empirically identify a “strong reasoning emergence” phenomenon: the model achieves 83.3% accuracy on competitive programming tasks, 100% on high-school-level mathematical reasoning, and surpasses state-of-the-art models in radiological diagnosis and chip design. Its cross-domain performance approaches or matches that of human experts—providing pivotal empirical evidence and a novel evaluation paradigm for large language models advancing toward AGI.

Technology Category

Application Category

📝 Abstract
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluates OpenAI o1's performance in diverse complex reasoning tasks
Assesses human-level or superior capabilities across multiple domains
Identifies limitations and progress toward artificial general intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates OpenAI o1-preview model performance
Achieves human-level complex reasoning tasks
Demonstrates AGI progress across domains
🔎 Similar Papers
No similar papers found.
T
Tianyang Zhong
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada
Z
Zheng Liu
School of Computing, University of Georgia, GA, USA
Y
Yi Pan
School of Computing, University of Georgia, GA, USA
Y
Yutong Zhang
Institute of Medical Research, Northwestern Polytechnical University, Xi’an, China
Y
Yifan Zhou
College of Arts and Sciences, University of Georgia, Athens, USA
S
Shizhe Liang
Institute of Plant Breeding, Genetics & Genomics, University of Georgia, Athens, GA, USA
Zihao Wu
Zihao Wu
University of Georgia
Brain-inspired AIArtificial General IntelligenceNLPMedical Image Analysis
Yanjun Lyu
Yanjun Lyu
PhD Student of Computer Science, University of Texas at Arlington
P
Peng Shu
Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA
X
Xiao-Xing Yu
Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA
C
Chao-Yang Cao
Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA
Hanqi Jiang
Hanqi Jiang
University of Georgia
Medical Image AnalysisMulti-modal Large Language Models
H
Hanxu Chen
The Lamar Dodd School of Art, University of Georgia, GA, USA
Y
Yiwei Li
School of Computing, University of Georgia, GA, USA
J
Junhao Chen
School of Computing, University of Georgia, GA, USA
Huawen Hu
Huawen Hu
Northwestern Polytechnical University
Reinforcement LeariningRoboticsBrain Computer Interface
Y
Yihe Liu
Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA
H
Huaqin Zhao
School of Computing, University of Georgia, GA, USA
S
Shaochen Xu
School of Computing, University of Georgia, GA, USA
Haixing Dai
Haixing Dai
School of Computing, University of Georgia, GA, USA
L
Lin Zhao
School of Computing, University of Georgia, GA, USA
Ruidong Zhang
Ruidong Zhang
Cornell University
Ubiquitous computingWearable computing
W
Wei Zhao
Department of Radiology, The Second Xiangya Hospital, Central South Uni