The Ever-Evolving Science Exam

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing scientific evaluation benchmarks suffer from data leakage risks and low assessment efficiency. To address these issues, this paper introduces EESE, a dynamic scientific evaluation benchmark. Methodologically, EESE constructs a private, continuously updated expert-curated question pool spanning five major disciplines and 500+ subfields; it employs discipline-aware hierarchical sampling and leakage-resistant random subset generation to enable efficient, robust, and contamination-free automated evaluation. Rigorous question quality and evaluation validity are ensured through multi-stage expert collaboration and manual verification. Experimental evaluation across 32 open- and closed-source models demonstrates that EESE effectively discriminates model capabilities along both scientific knowledge and cognitive reasoning dimensions. Moreover, EESE supports longitudinal, scalable, and forward-compatible tracking of scientific reasoning abilities—enabling reliable benchmarking as models evolve.

Technology Category

Application Category

📝 Abstract

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

Problem

Research questions and friction points this paper is trying to address.

Address data leakage risks in science benchmarks

Improve evaluation efficiency for large-scale testing

Assess scientific capabilities of foundation models reliably

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark for scientific model evaluation

Non-public expert-constructed question-answer pool

Leakage-resilient periodically updated test subset

🔎 Similar Papers

No similar papers found.