PreScience: A Benchmark for Forecasting Scientific Contributions

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes PreScience—the first multitask benchmark for forecasting scientific progress—aimed at leveraging AI to predict future research directions, collaborators, and impact based on historical scholarly records. The benchmark comprises a temporally aligned, author-disambiguated dataset of 98K AI papers and 502K related publications, supporting four tasks: collaborator prediction, seminal work identification, contribution generation, and impact forecasting. The study introduces LACERScore, a novel large language model–based automatic evaluation metric that demonstrates high agreement with human annotations. Experimental results reveal that even state-of-the-art models such as GPT-5 achieve only moderate performance on contribution generation (averaging 5.6/10) and produce outputs significantly less diverse and novel than contemporaneous human-generated contributions.

Technology Category

Application Category

📝 Abstract
Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.
Problem

Research questions and friction points this paper is trying to address.

scientific forecasting
AI prediction
research contribution
impact prediction
collaborator prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific forecasting
contribution generation
LACERScore
research benchmark
LLM evaluation
🔎 Similar Papers
No similar papers found.