🤖 AI Summary
Existing medical AI ethics evaluation benchmarks largely overlook large language models’ (LLMs) capacity to dynamically apply core ethical principles. Method: We introduce PrinciplismQA—the first benchmark grounded in the four-principle framework (autonomy, beneficence, non-maleficence, justice)—integrating authoritative textbook knowledge and real-world clinical cases. It explicitly distinguishes ethical knowledge acquisition from practical reasoning ability, employing expert-validated multiple-choice and open-ended questions for standardized assessment. Contribution/Results: Experiments reveal a pervasive “knowing-doing gap”: mainstream LLMs exhibit significantly weaker ethical application than factual recall. Closed-source frontier models outperform open-source counterparts; medical domain fine-tuning improves overall performance but fails to resolve systematic biases in principle trade-offs and beneficence judgments. PrinciplismQA establishes a reproducible, decomposable evaluation paradigm for LLM ethical alignment in healthcare.
📝 Abstract
The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.