Make Literature-Based Discovery Great Again through Reproducible Pipelines

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address longstanding challenges in Literature-Based Discovery (LBD)—including irreproducible experiments, fragmented evaluation protocols, and lack of standardized benchmarks—this paper introduces the first end-to-end, fully reproducible dual-association LBD pipeline. We propose three novel methods: (1) an ensemble learning framework, (2) an anomaly detection–driven model, and (3) a knowledge graph link prediction–enhanced model. The pipeline unifies preprocessing, publicly released annotated datasets, standardized evaluation protocols, and a Dockerized execution environment. All source code, Jupyter notebooks, and supporting resources are openly available. This infrastructure enables rapid replication of state-of-the-art methods and fair, controlled comparison of new approaches. Our work significantly advances reproducibility, collaborative efficiency, and cross-disciplinary hypothesis generation in LBD research, establishing a robust foundation for reliable scientific discovery. (149 words)

Technology Category

Application Category

📝 Abstract

By connecting disparate sources of scientific literature, literature-/based discovery (LBD) methods help to uncover new knowledge and generate new research hypotheses that cannot be found from domain-specific documents alone. Our work focuses on bisociative LBD methods that combine bisociative reasoning with LBD techniques. The paper presents LBD through the lens of reproducible science to ensure the reproducibility of LBD experiments, overcome the inconsistent use of benchmark datasets and methods, trigger collaboration, and advance the LBD field toward more robust and impactful scientific discoveries. The main novelty of this study is a collection of Jupyter Notebooks that illustrate the steps of the bisociative LBD process, including data acquisition, text preprocessing, hypothesis formulation, and evaluation. The contributed notebooks implement a selection of traditional LBD approaches, as well as our own ensemble-based, outlier-based, and link prediction-based approaches. The reader can benefit from hands-on experience with LBD through open access to benchmark datasets, code reuse, and a ready-to-run Docker recipe that ensures reproducibility of the selected LBD methods.

Problem

Research questions and friction points this paper is trying to address.

Enhance reproducibility in literature-based discovery

Standardize use of benchmark datasets and methods

Facilitate collaboration and robust scientific discoveries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jupyter Notebooks for LBD

Ensemble-based LBD approaches

Docker recipe for reproducibility

🔎 Similar Papers

An Autonomous Large Language Model Agent for Chemical Literature Data Mining