Large Language Models Achieve Gold Medal Performance at International Astronomy & Astrophysics Olympiad

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of astronomy-focused large language models (LLMs) predominantly assess basic question-answering and knowledge retrieval, neglecting systematic assessment of core scientific competencies—including deep conceptual understanding, multi-step deductive reasoning, and multimodal astronomical analysis. Method: We introduce the first benchmark grounded in authentic theoretical and data-analysis problems from the International Olympiad on Astronomy and Astrophysics (IOAA), rigorously evaluating five state-of-the-art models, including Gemini 2.5 Pro and GPT-5. Contribution/Results: Gemini 2.5 Pro and GPT-5 achieve 85.6% and 84.2% accuracy on theoretical tasks—reaching gold-medal IOAA proficiency—but both exhibit consistent weaknesses in geometric reasoning and spatial visualization. GPT-5 attains 88.5% accuracy on data-analysis tasks, ranking among the top ten performers. This work establishes the first reproducible, competition-derived evaluation framework for complex astronomical reasoning, providing a foundational capability taxonomy and benchmark to guide domain-specific LLM development.

Technology Category

Application Category

📝 Abstract
While task-specific demonstrations show early success in applying large language models (LLMs) to automate some astronomical research tasks, they only provide incomplete views of all necessary capabilities in solving astronomy problems, calling for more thorough understanding of LLMs' strengths and limitations. So far, existing benchmarks and evaluations focus on simple question-answering that primarily tests astronomical knowledge and fails to evaluate the complex reasoning required for real-world research in the discipline. Here, we address this gap by systematically benchmarking five state-of-the-art LLMs on the International Olympiad on Astronomy and Astrophysics (IOAA) exams, which are designed to examine deep conceptual understanding, multi-step derivations, and multimodal analysis. With average scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 (the two top-performing models) not only achieve gold medal level performance but also rank in the top two among ~200-300 participants in all four IOAA theory exams evaluated (2022-2025). In comparison, results on the data analysis exams show more divergence. GPT-5 still excels in the exams with an 88.5% average score, ranking top 10 among the participants in the four most recent IOAAs, while other models' performances drop to 48-76%. Furthermore, our in-depth error analysis underscores conceptual reasoning, geometric reasoning, and spatial visualization (52-79% accuracy) as consistent weaknesses among all LLMs. Hence, although LLMs approach peak human performance in theory exams, critical gaps must be addressed before they can serve as autonomous research agents in astronomy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' complex reasoning capabilities in astronomy
Benchmarking models on Olympiad exams requiring multi-step derivations
Identifying conceptual and spatial reasoning as key limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs on International Astronomy Olympiad exams
Evaluating conceptual reasoning and multi-step derivations
Identifying geometric reasoning as key weakness in models
🔎 Similar Papers
No similar papers found.
L
Lucas Carrit Delgado Pinheiro
Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA.
Ziru Chen
Ziru Chen
The Ohio State University
Conversational AINatural Language ProcessingMachine Learning
B
Bruno Caixeta Piazza
Escola Politécnica, Universidade de São Paulo, São Paulo, SP 05508-010, Brazil.
Ness Shroff
Ness Shroff
Department of ECE and CSE, The Ohio State University
Machine LearningWireless NetworksPerformance EvaluationCloud ComputingMobile Networks
Y
Yingbin Liang
Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA.
Y
Yuan-Sen Ting
Department of Astronomy, The Ohio State University, Columbus, OH 43210, USA.
Huan Sun
Huan Sun
Endowed CoE Innovation Scholar and Associate Professor, The Ohio State University
AgentsLarge Language ModelsNatural Language ProcessingAI