🤖 AI Summary
Modeling non-time-separable objectives—such as creativity, imitation, fairness, and safety—in multi-agent reinforcement learning remains challenging due to their incompatibility with standard additive reward assumptions. Method: This paper introduces the Convex Markov Game (CMG) framework, which models agents’ generalized convex preferences over state-action occupancy measures, relaxing the traditional additivity requirement of Markov games. It establishes, for the first time, the existence of pure-strategy Nash equilibria in infinite-horizon CMGs. Furthermore, it proposes a gradient-descent-based approximation algorithm leveraging an upper bound on exploitability defined over the occupancy measure space. Contribution/Results: Experiments demonstrate that the method achieves high payoff and low exploitability in repeated Prisoner’s Dilemma; autonomously discovers fair solutions in asymmetric coordination games; and ensures long-term safety in a robotic warehouse task—thereby unifying diverse behavioral objectives within a single, theoretically grounded framework.
📝 Abstract
Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner's dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.