Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of fine-grained information extraction from bioinformatics workflows in scientific literature under low-resource conditions, this paper proposes BioToFlow: (1) the first high-quality, manually annotated corpus specifically designed for bioinformatics workflow extraction; (2) a domain-informed named entity recognition (NER) framework that systematically evaluates and integrates four low-resource strategies—SciBERT fine-tuning, few-shot learning, masked language modeling, and knowledge injection; and (3) achieves 70.4 F1 on BioToFlow, matching inter-annotator agreement, with significant performance gains for critical entity types after knowledge enhancement. This work constitutes the first empirical validation of high-accuracy, reusable structured workflow extraction in low-resource settings, establishing a novel paradigm for automated discovery and reuse of computational protocols in scholarly literature.

Technology Category

Application Category

📝 Abstract
Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.
Problem

Research questions and friction points this paper is trying to address.

Extracting workflow information from bioinformatics articles
Improving accessibility and reusability of workflow data
Addressing limited annotated corpora in low-resource settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tailored annotated corpus for bioinformatics workflows
Few-shot NER with autoregressive language models
NER using masked language models with corpora
🔎 Similar Papers
No similar papers found.
C
Clémence Sebe
Université Paris-Saclay, LISN, CNRS, Orsay, 91400, France
S
Sarah Cohen-Boulakia
Université Paris-Saclay, LISN, CNRS, Orsay, 91400, France
Olivier Ferret
Olivier Ferret
Senior research scientist, CEA-List
Natural Language ProcessingComputational LinguisticsInformation ExtractionLexical Semantics
A
Aur'elie N'ev'eol
Université Paris-Saclay, LISN, CNRS, Orsay, 91400, France