ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) lack systematic evaluation of their capability to generate high-quality, scientifically grounded research hypotheses. Method: We introduce SciHypoBench—the first large-scale, domain-agnostic benchmark for scientific hypothesis discovery—comprising three subtasks: inspiration retrieval, hypothesis generation, and hypothesis ranking. It is constructed from 2024 publications across 12 disciplines, with key scientific elements automatically extracted and expert-validated. We propose the “inspiration-driven” task decomposition paradigm, formally define and evaluate LLMs’ capacity for high-value hypothesis discovery, and conceptualize the “research hypothesis mine.” Our methodology enforces zero-contamination data isolation, multi-disciplinary structured parsing, and human-in-the-loop validation. Results: Experiments show that state-of-the-art LLMs excel at out-of-domain (OOD) cross-disciplinary inspiration retrieval, scale effectively to generate novel and feasible hypotheses, and substantially reduce human curation effort.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as"research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate high-quality research hypotheses
Creating a benchmark for scientific discovery sub-tasks
Assessing LLMs' performance in inspiration retrieval and hypothesis generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs via inspiration-based task decomposition
Automated framework extracts components from 12 disciplines
Focus on 2024 papers to prevent data contamination
🔎 Similar Papers
No similar papers found.