🤖 AI Summary
This study addresses the end-to-end relation extraction (E2ERE) challenge posed by discontinuous and nested entities in rare disease texts. On the newly constructed RareDis benchmark—the first E2ERE dataset dedicated to rare diseases—we systematically evaluate three paradigms: pipeline (NER→RE), sequence-to-sequence (Seq2Seq), and generative GPT-based approaches. Experiments show that a customized pipeline model achieves the highest F1 score, significantly outperforming GPT by over 10 points; Seq2Seq yields intermediate performance, while GPT suffers from poor few-shot generalization. Cross-dataset evaluation (RareDis/CHEMPROT) and error analysis further identify key performance bottlenecks. Our contributions are threefold: (1) releasing RareDis, the first public E2ERE benchmark for rare diseases; (2) demonstrating that compact, domain-adapted models retain superior accuracy and efficiency under data scarcity; and (3) proposing a collaborative modeling paradigm that synergizes the precision of small models with the generalization capacity of large language models—establishing a reproducible new baseline for biomedical knowledge extraction.
📝 Abstract
End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $
ightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.