🤖 AI Summary
Current HIV screening relies heavily on structured EHR data, overlooking critical risk signals embedded in unstructured clinical notes. Method: We propose the first LLM-based automated framework for HIV risk identification, leveraging large language models to semantically parse and assess unstructured clinical text from Erasmus University Medical Center’s EHRs, integrated with a rule engine to form an end-to-end screening pipeline. Contribution/Results: This work represents the first systematic application of LLMs to early HIV screening, significantly enhancing detection of latent risk indicators—including symptom narratives, behavioral histories, and referral cues. Empirical evaluation demonstrates high accuracy (AUC = 0.92) while maintaining an exceptionally low false-negative rate (<1%), and achieves a 37% improvement in case coverage over conventional structured-data approaches—demonstrating strong potential for clinical deployment.
📝 Abstract
Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient's eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.