SciDA: Scientific Dynamic Assessor of LLMs

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing scientific capability benchmarks suffer from data contamination and insufficient disciplinary coverage, leading to systematic overestimation of large language models’ (LLMs) numerical reasoning abilities. Method: We introduce the first dynamic evaluation benchmark for scientific numerical reasoning, comprising 1,000+ Olympiad-level, multi-disciplinary computational problems. It features a novel dynamic numerical initialization mechanism that randomly samples problem parameters per inference step, eliminating training data leakage and overfitting. The benchmark integrates parametric problem generation, cross-disciplinary scientific modeling, and a robustness-aware evaluation framework to enable contamination-free, reproducible assessment. Contribution/Results: Experiments across leading closed- and open-weight LLMs reveal an average performance drop of 23.7%, exposing severe pattern dependency in current LLM numerical reasoning. This benchmark establishes the first decontaminated, realistic metric for scientific reasoning capability.

Technology Category

Application Category

📝 Abstract

Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at https://huggingface.co/datasets/m-a-p/SciDA

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' scientific reasoning without data contamination

Evaluating numerical reasoning with randomized initializations

Providing unbiased benchmarks for multidisciplinary scientific problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark with randomized numerical initializations

Multidisciplinary Olympic-level numerical problems

Avoids data contamination via unique numerical patterns

🔎 Similar Papers

No similar papers found.