ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing benchmarks struggle to evaluate the creative reasoning capabilities of large language models in generating scientific hypotheses under conditions of incomplete information. This work proposes a progressive information disclosure framework that begins with only a research topic and incrementally reveals technical details across stages, requiring the model to generate hypotheses at each step. Hypotheses are decomposed into atomic statements and automatically compared against the original paper’s conclusions using semantic similarity metrics. The framework enables, for the first time, quantitative assessment of a model’s entire hypothesis-to-experimental-reasoning trajectory across diverse scientific discovery scenarios. Experiments on 45 materials science papers demonstrate that GPT-5.4 achieves an F1 score of 0.7 even with minimal contextual information, significantly outperforming its predecessors.

📝 Abstract

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

Problem

Research questions and friction points this paper is trying to address.

scientific discovery

hypothesis generation

large language models

progressive information disclosure

scientific reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive information disclosure

scientific hypothesis generation

semantic similarity evaluation