🤖 AI Summary
This study addresses critical safety risks in currently deployed medical large language models (MedGPTs), including hallucinations, policy violations, and insufficient privacy disclosures, which lack systematic evaluation. We propose MedGPT-HEval, the first hallucination detection framework tailored for medical LLMs, alongside an LLM-based policy compliance assessment pipeline. Leveraging a stratified sampling test suite, we conduct a large-scale evaluation of 6,233 online MedGPT instances and 10 open-source models. Through an automated evaluation pipeline, semantic alignment analysis, and multidimensional safety metrics, we find that 25–30% of models exhibit low factual accuracy, 33.6–54.3% violate operational thresholds, and 57.06% of action-capable models lack privacy disclosures. To advance research in this domain, we release HAA-MedGPT, a structured dataset that significantly facilitates the study of medical LLM safety.
📝 Abstract
Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.