Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Amid China’s “Healthy China 2030” initiative, large medical language models (LMLMs) urgently require systematic governance of ethical and safety risks. Method: We construct the first comprehensive Chinese benchmark for evaluating LMLMs—comprising 12,000 items spanning 11 ethical and 9 safety dimensions—and conduct zero-shot and fine-tuned evaluations with dual-axis ethical–safety analysis. Contribution/Results: Our evaluation reveals a low baseline accuracy of 42.7% across mainstream models (improving to 50.8% post-fine-tuning), and identifies critical institutional gaps, including absent ethical auditing and delayed Institutional Review Board (IRB) responsiveness. We propose a novel three-tier collaborative governance framework integrating embedded audit teams, data ethics guidelines, and safety simulation pipelines—enabling institutional-level process modeling and scalable risk management. This work establishes both an evaluation standard and a governance paradigm for the safe, responsible deployment of domestically developed medical LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are poised to transform healthcare under China's Healthy China 2030 initiative, yet they introduce new ethical and patient-safety challenges. We present a novel 12,000-item Q&A benchmark covering 11 ethics and 9 safety dimensions in medical contexts, to quantitatively evaluate these risks. Using this dataset, we assess state-of-the-art Chinese medical LLMs (e.g., Qwen 2.5-32B, DeepSeek), revealing moderate baseline performance (accuracy 42.7% for Qwen 2.5-32B) and significant improvements after fine-tuning on our data (up to 50.8% accuracy). Results show notable gaps in LLM decision-making on ethics and safety scenarios, reflecting insufficient institutional oversight. We then identify systemic governance shortfalls-including the lack of fine-grained ethical audit protocols, slow adaptation by hospital IRBs, and insufficient evaluation tools-that currently hinder safe LLM deployment. Finally, we propose a practical governance framework for healthcare institutions (embedding LLM auditing teams, enacting data ethics guidelines, and implementing safety simulation pipelines) to proactively manage LLM risks. Our study highlights the urgent need for robust LLM governance in Chinese healthcare, aligning AI innovation with patient safety and ethical standards.
Problem

Research questions and friction points this paper is trying to address.

Assessing ethical and safety risks of healthcare LLMs in China
Evaluating performance gaps in medical LLM decision-making
Proposing governance framework for safe LLM deployment in healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed 12,000-item Q&A benchmark for ethics and safety
Assessed Chinese medical LLMs using fine-tuning techniques
Proposed governance framework with auditing and safety protocols
M
Mouxiao Bian
Shanghai Artificial Intelligence Laboratory, , Shanghai, China
Rongzhao Zhang
Rongzhao Zhang
Shanghai AI Lab
Medical Image AnalysisComputer Vision
C
Chao Ding
Shanghai Artificial Intelligence Laboratory, , Shanghai, China
X
Xinwei Peng
Shanghai Artificial Intelligence Laboratory, , Shanghai, China
J
Jie Xu
Shanghai Artificial Intelligence Laboratory, , Shanghai, China