Towards Assessing Medical Ethics from Knowledge to Practice

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical AI ethics evaluation benchmarks largely overlook large language models’ (LLMs) capacity to dynamically apply core ethical principles. Method: We introduce PrinciplismQA—the first benchmark grounded in the four-principle framework (autonomy, beneficence, non-maleficence, justice)—integrating authoritative textbook knowledge and real-world clinical cases. It explicitly distinguishes ethical knowledge acquisition from practical reasoning ability, employing expert-validated multiple-choice and open-ended questions for standardized assessment. Contribution/Results: Experiments reveal a pervasive “knowing-doing gap”: mainstream LLMs exhibit significantly weaker ethical application than factual recall. Closed-source frontier models outperform open-source counterparts; medical domain fine-tuning improves overall performance but fails to resolve systematic biases in principle trade-offs and beneficence judgments. PrinciplismQA establishes a reproducible, decomposable evaluation paradigm for LLM ethical alignment in healthcare.

Technology Category

Application Category

📝 Abstract
The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.
Problem

Research questions and friction points this paper is trying to address.

Assessing ethical reasoning of LLMs in healthcare
Evaluating alignment with core medical ethics principles
Bridging gap between ethical knowledge and practical application
Innovation

Methods, ideas, or system contributions that make the work stand out.

PrinciplismQA benchmark assesses LLM ethics
Combines multiple-choice and open-ended questions
Medical fine-tuning improves ethical alignment
🔎 Similar Papers
No similar papers found.
C
Chang Hong
The Chinese University of Hong Kong, Shenzhen
M
Minghao Wu
The Chinese University of Hong Kong, Shenzhen
Q
Qingying Xiao
National Health Data Institute, Shenzhen
Yuchi Wang
Yuchi Wang
CUHK MMLab; Peking Uninversity
MultimodalityVLMGenerative Models
Xiang Wan
Xiang Wan
Shenzhen Research Institute of Big Data
BioinformaticsData MiningBig Data Analysis
G
Guangjun Yu
The Chinese University of Hong Kong, Shenzhen; National Health Data Institute, Shenzhen
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning
Y
Yan Hu
The Chinese University of Hong Kong, Shenzhen