🤖 AI Summary
This study investigates the performance of large language models (LLMs) in assessing the scientific plausibility of hypotheses, with a focus on how varying evidence conditions—ranging from hypothesis-only to inclusion of experimental descriptions, results, or both—affect model reliability and robustness. Framing the task as a diagnostic reasoning problem, the authors conduct systematic controlled experiments across multiple LLMs and two datasets to evaluate how input information types influence judgment accuracy. Their findings reveal, for the first time, that result-based evidence consistently enhances model performance more robustly than experimental descriptions alone; notably, incomplete experimental context can even degrade accuracy. This work provides empirical insights into the effective utilization of evidence in scientific reasoning by LLMs.
📝 Abstract
Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.