Towards Assessing Medical Ethics from Knowledge to Practice

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing medical AI ethics evaluation benchmarks largely overlook large language models’ (LLMs) capacity to dynamically apply core ethical principles. Method: We introduce PrinciplismQA—the first benchmark grounded in the four-principle framework (autonomy, beneficence, non-maleficence, justice)—integrating authoritative textbook knowledge and real-world clinical cases. It explicitly distinguishes ethical knowledge acquisition from practical reasoning ability, employing expert-validated multiple-choice and open-ended questions for standardized assessment. Contribution/Results: Experiments reveal a pervasive “knowing-doing gap”: mainstream LLMs exhibit significantly weaker ethical application than factual recall. Closed-source frontier models outperform open-source counterparts; medical domain fine-tuning improves overall performance but fails to resolve systematic biases in principle trade-offs and beneficence judgments. PrinciplismQA establishes a reproducible, decomposable evaluation paradigm for LLM ethical alignment in healthcare.

Technology Category

Application Category

📝 Abstract

The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.

Problem

Research questions and friction points this paper is trying to address.

Assessing ethical reasoning of LLMs in healthcare

Evaluating alignment with core medical ethics principles

Bridging gap between ethical knowledge and practical application

Innovation

Methods, ideas, or system contributions that make the work stand out.

PrinciplismQA benchmark assesses LLM ethics

Combines multiple-choice and open-ended questions

Medical fine-tuning improves ethical alignment

🔎 Similar Papers

Reinforcement Learning and Machine ethics:a systematic review