🤖 AI Summary
This work addresses the lack of computable and evaluable moral alignment in current AI systems when confronted with hierarchical and potentially conflicting human ethical norms. The authors propose Morality Chains, a formal framework that models hierarchical moral rules as ordered deontic constraints, and introduce MoralityGym—a Gymnasium-based benchmark comprising 98 ethically challenging scenarios—to enable decoupled evaluation of task performance and moral judgment. Additionally, they develop a Morality Metric grounded in insights from psychology and philosophy to quantify an agent’s moral reasoning capabilities in sequential decision-making. Experimental results reveal significant limitations of existing safe reinforcement learning approaches in handling complex moral reasoning, thereby laying a foundation for the development of reliable, transparent, and ethically aligned AI systems.
📝 Abstract
Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.