π€ AI Summary
This study addresses the absence of publicly available named entity recognition (NER) datasets for prion diseases, which has hindered clinical information extraction from biomedical literature. To bridge this gap, the authors introduce PrionNERβthe first fine-grained, non-flat clinical NER dataset specifically designed for prion diseases, encompassing 15 coarse-grained and 31 fine-grained entity types. Constructed from 317 PubMed abstracts, the dataset underwent rigorous manual annotation and inter-annotator agreement evaluation, achieving an F1 score of 81.78. The work further establishes supervised and zero-shot NER baselines using state-of-the-art models such as BERT, W2NER, and Gemma. PrionNER fills a critical void in low-resource rare disease natural language processing and provides a high-quality benchmark resource for future research in this domain.
π Abstract
Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.