EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing geoscience evaluation benchmarks suffer from three critical limitations: insufficient domain specificity, fragmented coverage across subfields, and neglect of open-ended scientific exploration capabilities. To address these gaps, we introduce EarthEval—the first comprehensive large language model (LLM) benchmark for geoscience—spanning five Earth system spheres, 114 disciplinary topics, and 11 task categories. It comprises three hierarchical QA datasets: Earth-Iron (breadth), Earth-Silver (depth), and Earth-Gold (multi-turn open dialogue), constructed from 100,000 peer-reviewed papers with multi-granularity domain annotations. We propose a novel metric suite to systematically assess higher-order scientific reasoning—including method induction, limitation analysis, and conceptual proposal—for the first time. Evaluation across 11 state-of-the-art LLMs reveals substantial deficiencies in geoscientific competence. EarthEval is publicly released on Hugging Face to advance domain-specialized scientific evaluation.

Technology Category

Application Category

📝 Abstract
Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' Earth science exploration capabilities holistically
Assessing open-ended scientific exploration in Earth science domains
Addressing gaps in specialized Earth science benchmark datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs QA datasets from 100,000 research papers
Introduces open-ended multi-turn dialogue dataset Earth-Gold
Evaluates LLMs across 114 disciplines and 11 tasks
🔎 Similar Papers
No similar papers found.