🤖 AI Summary
To address the challenge of automated biomedical knowledge graph updating hindered by the absence of directional annotations (i.e., ambiguous subject/object roles) in entity relations within biomedical literature, this work introduces BioRED+, the first fine-grained directionally annotated corpus—containing 10,864 subject/object-labeled instances. We propose a soft-prompt-based multitask language model that jointly performs relation classification, scientific novelty assessment, and entity role disambiguation (subject vs. object). Crucially, our framework establishes the first document-level, direction-aware modeling paradigm for BioRED, integrating biomedical named entity recognition with relation extraction. Experiments demonstrate significant performance gains over state-of-the-art large language models—including GPT-4 and Llama-3—on two benchmark tasks, particularly in directional relation extraction. Both the code and the BioRED+ corpus are publicly released.
📝 Abstract
Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the develop-ment of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relationships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results in-clude an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks. Our source code and dataset are available at https://github.com/ncbi-nlp/BioREDirect.