🤖 AI Summary
Existing scientific evaluation benchmarks suffer from data leakage risks and low assessment efficiency. To address these issues, this paper introduces EESE, a dynamic scientific evaluation benchmark. Methodologically, EESE constructs a private, continuously updated expert-curated question pool spanning five major disciplines and 500+ subfields; it employs discipline-aware hierarchical sampling and leakage-resistant random subset generation to enable efficient, robust, and contamination-free automated evaluation. Rigorous question quality and evaluation validity are ensured through multi-stage expert collaboration and manual verification. Experimental evaluation across 32 open- and closed-source models demonstrates that EESE effectively discriminates model capabilities along both scientific knowledge and cognitive reasoning dimensions. Moreover, EESE supports longitudinal, scalable, and forward-compatible tracking of scientific reasoning abilities—enabling reliable benchmarking as models evolve.
📝 Abstract
As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.