๐ค AI Summary
To address the limited hypothesis validation efficiency caused by high cost and low throughput of wet-lab experiments, this paper proposes an experiment-guided hypothesis ranking paradigm. We formally define this task for the first time; develop an interpretable *in silico* hypothesis simulator that integrates domain knowledge with noise-aware modeling; and design a dynamic ranking method leveraging functional clustering and simulation-based feedback. Evaluated on a real-world chemical hypothesis dataset comprising 124 hypotheses, our approach significantly outperforms baselines relying solely on internal reasoning of large language models. Ablation studies confirm the individual contributions of each component. Results demonstrate that the simulation feedback mechanism effectively bridges the gap between theoretical inference and empirical constraints, achieving both strong generalizability and interpretability. This work establishes a novel pathway for dataโexperiment co-driven scientific discovery.
๐ Abstract
Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.