Toward Virtuous Reinforcement Learning

πŸ“… 2025-12-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current RL ethics research faces two key bottlenecks: deontological approaches exhibit poor adaptability in non-stationary environments and fail to cultivate stable moral dispositions; single-objective reward schemes compress pluralistic values, induce reward hacking, and obscure ethical trade-offs. This paper proposes the first virtue-oriented RL framework, modeling ethics as stable, learnable, and updatable character dispositions at the policy level. Methodologically, it integrates multi-agent social learning, multi-objective constrained optimization, affinity-based virtue prior regularization, and explicit cross-cultural ethical traditions as control signals. The framework significantly enhances agents’ moral robustness under distributional shift and human intervention, improves value evolution capability and decision transparency, supports virtue retention across contexts, effectively mitigates reward hacking, and strengthens expressive and coordinative capacity for complex value conflicts.

Technology Category

Application Category

πŸ“ Abstract
This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Critiques limitations in current machine ethics for Reinforcement Learning
Proposes virtue-focused approach for stable ethical habits in RL
Roadmap includes multi-agent learning and multi-objective formulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent social learning from imperfect exemplars
Multi-objective constrained formulations with risk-aware criteria
Affinity-based regularization for stable, updateable virtue priors