🤖 AI Summary
Medical chief complaint texts exhibit high lexical variability and suffer from a lack of annotated data, hindering terminology standardization. To address this, we propose a weakly supervised, end-to-end framework for entity extraction and ontology linking. Our approach introduces a novel “split-and-match” algorithm that automatically generates high-quality weak supervision signals—eliminating the need for manual annotation—and jointly models mention detection and standardized concept linking within a BERT-based architecture. Evaluated on 1.2 million real-world chief complaint records, our method significantly outperforms existing unsupervised and weakly supervised baselines in both precision and cross-institutional generalizability. It achieves robust performance without domain-specific lexicons or handcrafted rules, offering a scalable, low-dependency solution for clinical natural language processing tasks requiring consistent medical terminology normalization.
📝 Abstract
A Chief complaint (CC) is the reason for the medical visit as stated in the patient's own words. It helps medical professionals to quickly understand a patient's situation, and also serves as a short summary for medical text mining. However, chief complaint records often take a variety of entering methods, resulting in a wide variation of medical notations, which makes it difficult to standardize across different medical institutions for record keeping or text mining. In this study, we propose a weakly supervised method to automatically extract and link entities in chief complaints in the absence of human annotation. We first adopt a split-and-match algorithm to produce weak annotations, including entity mention spans and class labels, on 1.2 million real-world de-identified and IRB approved chief complaint records. Then we train a BERT-based model with generated weak labels to locate entity mentions in chief complaint text and link them to a pre-defined ontology. We conducted extensive experiments, and the results showed that our Weakly Supervised Entity Extraction and Linking (ours) method produced superior performance over previous methods without any human annotation.