Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This study addresses critical safety risks in currently deployed medical large language models (MedGPTs), including hallucinations, policy violations, and insufficient privacy disclosures, which lack systematic evaluation. We propose MedGPT-HEval, the first hallucination detection framework tailored for medical LLMs, alongside an LLM-based policy compliance assessment pipeline. Leveraging a stratified sampling test suite, we conduct a large-scale evaluation of 6,233 online MedGPT instances and 10 open-source models. Through an automated evaluation pipeline, semantic alignment analysis, and multidimensional safety metrics, we find that 25–30% of models exhibit low factual accuracy, 33.6–54.3% violate operational thresholds, and 57.06% of action-capable models lack privacy disclosures. To advance research in this domain, we release HAA-MedGPT, a structured dataset that significantly facilitates the study of medical LLM safety.
📝 Abstract
Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.
Problem

Research questions and friction points this paper is trying to address.

hallucination
policy violation
medical LLMs
privacy disclosure
factual accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination detection
policy compliance
medical LLM evaluation
MedGPT-HEval
web-deployed LLM safety