Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical disconnect between tool references in scientific narratives and their implementations in executable bioinformatics workflows, which severely hinders reproducibility and reuse. To bridge this gap, we propose CoPaLink, the first end-to-end framework that jointly identifies tool entities in both scientific text and Nextflow code and links them across modalities using established bioinformatics knowledge bases such as Bioconda and Bioweb. Trained on a manually annotated corpus, our approach achieves F1 scores of 84–89% for tool entity recognition in individual modules and an overall linking accuracy of 66%. By aligning descriptive content with computational implementation, CoPaLink significantly narrows the semantic gap between narrative methods descriptions and executable code, thereby enhancing support for workflow reproducibility.

Technology Category

Application Category

📝 Abstract
Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using Bioconda and Bioweb Knowledge bases. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments and https://gitlab.liris.cnrs.fr/sharefair/copalink. The corpora are also available at https://doi.org/10.5281/zenodo.18526700, https://doi.org/10.5281/zenodo.18526760 and https://doi.org/10.5281/zenodo.18543814.
Problem

Research questions and friction points this paper is trying to address.

workflow reproducibility
bioinformatics tools
entity linking
scientific text
executable code
Innovation

Methods, ideas, or system contributions that make the work stand out.

workflow reproducibility
tool linking
named entity recognition
bioinformatics knowledge base
cross-modal entity alignment
🔎 Similar Papers
No similar papers found.
C
Clémence Sebe
Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400, Orsay, France
Olivier Ferret
Olivier Ferret
Senior research scientist, CEA-List
Natural Language ProcessingComputational LinguisticsInformation ExtractionLexical Semantics
A
Aurélie Névéol
Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400, Orsay, France
M
Mahdi Esmailoghli
Department of Computer Science, Humboldt-Universität zu Berlin, 10099, Berlin, Germany
U
Ulf Leser
Department of Computer Science, Humboldt-Universität zu Berlin, 10099, Berlin, Germany
S
Sarah Cohen-Boulakia
Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400, Orsay, France