Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates the performance of large language models (LLMs) in assessing the scientific plausibility of hypotheses, with a focus on how varying evidence conditions—ranging from hypothesis-only to inclusion of experimental descriptions, results, or both—affect model reliability and robustness. Framing the task as a diagnostic reasoning problem, the authors conduct systematic controlled experiments across multiple LLMs and two datasets to evaluate how input information types influence judgment accuracy. Their findings reveal, for the first time, that result-based evidence consistently enhances model performance more robustly than experimental descriptions alone; notably, incomplete experimental context can even degrade accuracy. This work provides empirical insights into the effective utilization of evidence in scientific reasoning by LLMs.

Technology Category

Application Category

📝 Abstract

Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.

Problem

Research questions and friction points this paper is trying to address.

scientific feasibility

large language models

experimental evidence

outcome evidence

diagnostic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific feasibility

large language models

diagnostic reasoning