🤖 AI Summary
Existing AI4Science benchmarks primarily focus on citation prediction or literature retrieval, which inadequately capture the contribution pathways underpinning scientific discovery. This work introduces the task of scientific discovery path prediction: given a target finding and a corpus of historical literature, the goal is to identify the enabling contributions necessary for its realization and anchor them to prior work or label them as unmapped decisions. To this end, we present SciPaths, the first benchmark designed to evaluate path reasoning capabilities, featuring expert-annotated gold-standard paths alongside large-scale silver-standard paths. We further propose a framework integrating semantic matching, large language model evaluation, and role-evidence annotation. Experiments reveal that state-of-the-art models achieve only an F1 score of 0.189 under strict semantic matching, with methodological dependencies proving hardest to recover; however, providing gold-standard enabling contributions substantially improves anchoring performance, highlighting path decomposition quality as the key bottleneck in end-to-end recovery.
📝 Abstract
Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.