🤖 AI Summary
This work addresses the critical disconnect between tool references in scientific narratives and their implementations in executable bioinformatics workflows, which severely hinders reproducibility and reuse. To bridge this gap, we propose CoPaLink, the first end-to-end framework that jointly identifies tool entities in both scientific text and Nextflow code and links them across modalities using established bioinformatics knowledge bases such as Bioconda and Bioweb. Trained on a manually annotated corpus, our approach achieves F1 scores of 84–89% for tool entity recognition in individual modules and an overall linking accuracy of 66%. By aligning descriptive content with computational implementation, CoPaLink significantly narrows the semantic gap between narrative methods descriptions and executable code, thereby enhancing support for workflow reproducibility.
📝 Abstract
Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using Bioconda and Bioweb Knowledge bases. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments and https://gitlab.liris.cnrs.fr/sharefair/copalink. The corpora are also available at https://doi.org/10.5281/zenodo.18526700, https://doi.org/10.5281/zenodo.18526760 and https://doi.org/10.5281/zenodo.18543814.