Mining Unstructured Medical Texts With Conformal Active Learning

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurately identifying symptoms from unstructured electronic health record (EHR) text for real-time epidemiological surveillance remains challenging due to data sparsity, annotation scarcity, and stringent latency and privacy requirements. Method: We propose a lightweight, interpretable text-mining framework integrating conformal prediction with active learning—novelly incorporating conformal prediction into medical text active learning. The framework employs explainable models (e.g., logistic regression, SVM), handcrafted textual features, and uncertainty-based sampling. Contribution/Results: With only 200 labeled samples, it achieves high classification robustness, matching or surpassing deep learning models on complex disease classification tasks. Deployable on edge devices, it reduces inference latency by 70% and improves annotation efficiency by over 5×, while preserving local data privacy through decentralized annotation and model training.

Technology Category

Application Category

📝 Abstract
The extraction of relevant data from Electronic Health Records (EHRs) is crucial to identifying symptoms and automating epidemiological surveillance processes. By harnessing the vast amount of unstructured text in EHRs, we can detect patterns that indicate the onset of disease outbreaks, enabling faster, more targeted public health responses. Our proposed framework provides a flexible and efficient solution for mining data from unstructured texts, significantly reducing the need for extensive manual labeling by specialists. Experiments show that our framework achieving strong performance with as few as 200 manually labeled texts, even for complex classification problems. Additionally, our approach can function with simple lightweight models, achieving competitive and occasionally even better results compared to more resource-intensive deep learning models. This capability not only accelerates processing times but also preserves patient privacy, as the data can be processed on weaker on-site hardware rather than being transferred to external systems. Our methodology, therefore, offers a practical, scalable, and privacy-conscious approach to real-time epidemiological monitoring, equipping health institutions to respond rapidly and effectively to emerging health threats.
Problem

Research questions and friction points this paper is trying to address.

Extracting data from unstructured EHRs
Reducing manual labeling in medical text analysis
Enabling real-time epidemiological monitoring with privacy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal Active Learning framework
Reduced manual labeling needs
Lightweight models for privacy
🔎 Similar Papers
No similar papers found.
J
Juliano Genari
Escola de Matemática Aplicada, Fundação Getúlio Vargas, Rio de Janeiro, Rio de Janeiro, Brazil
Guilherme Tegoni Goedert
Guilherme Tegoni Goedert
Professor, School for Applied Mathematics at FGV
Fluid DynamicsTurbulenceEpidemiologyAgent-based Models