🤖 AI Summary
This study addresses critical challenges in deploying large language models (LLMs) for HIV clinical decision support—namely, insufficient accuracy, poorly characterized safety risks, and low clinician acceptance. To this end, we introduce HIVMedQA, the first open-ended question-answering benchmark specifically designed for HIV management, with questions co-developed by clinicians to reflect authentic, complex clinical scenarios. We propose a dual-perspective automated evaluation framework integrating lexical similarity metrics and LLM-as-a-judge scoring to systematically uncover model disparities in reasoning capability, cognitive bias, and safety compliance. Leveraging medical knowledge–informed prompt engineering, we find Gemini 2.5 Pro achieves the strongest overall performance; notably, domain-specific fine-tuning and parameter count are not decisive determinants of success. Models exhibit marked performance degradation on complex queries and pervasive cognitive biases. Our work establishes a novel paradigm for trustworthy, clinically grounded evaluation and adaptation of LLMs in specialized medical domains.
📝 Abstract
Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.