DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations

📅 2024-09-29

🏛️ JAMIA Journal of the American Medical Informatics Association

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This study addresses the critical challenge of extracting diet–microbiome associations from biomedical literature. We introduce DiMB-RE, the first large-scale, fine-grained annotated corpus for this task—comprising 14,450 entities, 4,206 relations across 165 full-text articles, covering 15 entity types and 13 relation types. Notably, DiMB-RE is the first to systematically annotate entire results sections and explicitly capture domain-specific challenges, including cross-sentence relations and missing trigger words. Methodologically, we propose a multi-task NLP framework integrating BERT/SciBERT fine-tuning for named entity recognition (NER), trigger word detection, and end-to-end relation extraction, augmented with factual consistency checking. We further evaluate GPT-4o variants under zero- and one-shot generation settings. Experiments show state-of-the-art performance: 0.800 F1 for NER and 0.445 F1 for end-to-end relation extraction; results-section annotation yields significant gains. All resources are publicly released, establishing a new benchmark and foundational resource for personalized nutrition research.

Technology Category

Application Category

📝 Abstract

OBJECTIVES To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. MATERIALS AND METHODS We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (eg, Nutrient, Microorganism) and 13 relation types (eg, increases, improves) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked 2 generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. RESULTS DiMB-RE consists of 14 450 entities and 4206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. DISCUSSION To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. Natural language processing models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. CONCLUSION DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.

Problem

Research questions and friction points this paper is trying to address.

Identify diet-microbiome associations in biomedical literature

Develop NLP models for relation extraction and factuality

Improve understanding of diet-microbiome roles in health

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned NLP models for entity recognition

Constructed DiMB-RE corpus with diverse annotations

Benchmarked generative models in zero-shot settings

🔎 Similar Papers

Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity