Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses critical safety risks in currently deployed medical large language models (MedGPTs), including hallucinations, policy violations, and insufficient privacy disclosures, which lack systematic evaluation. We propose MedGPT-HEval, the first hallucination detection framework tailored for medical LLMs, alongside an LLM-based policy compliance assessment pipeline. Leveraging a stratified sampling test suite, we conduct a large-scale evaluation of 6,233 online MedGPT instances and 10 open-source models. Through an automated evaluation pipeline, semantic alignment analysis, and multidimensional safety metrics, we find that 25–30% of models exhibit low factual accuracy, 33.6–54.3% violate operational thresholds, and 57.06% of action-capable models lack privacy disclosures. To advance research in this domain, we release HAA-MedGPT, a structured dataset that significantly facilitates the study of medical LLM safety.

📝 Abstract

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

Problem

Research questions and friction points this paper is trying to address.

hallucination

policy violation

medical LLMs

privacy disclosure

factual accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination detection

policy compliance

medical LLM evaluation