🤖 AI Summary
How to induce long-term reciprocal cooperation among self-interested agents in sequential social dilemmas under limited learning time and without knowledge of opponents’ strategies?
Method: We propose the Reciprocator agent, which responds intrinsically to how an opponent’s actions affect its own returns; we design a reward-based reciprocity mechanism that requires neither opponent policy differentiation nor meta-game modeling; and we employ a Q-value shaping framework integrating counterfactual return estimation with dynamic reward re-scaling.
Contribution/Results: This work achieves, for the first time, learning-agnostic and sample-efficient cooperation induction—stably attaining Pareto-optimal cooperation across multiple canonical social dilemmas (e.g., Iterated Prisoner’s Dilemma, Stag Hunt, Chicken), significantly outperforming independent learners and state-of-the-art opponent-shaping approaches.
📝 Abstract
Cooperation between self-interested individuals is a widespread phenomenon in the natural world, but remains elusive in interactions between artificially intelligent agents. Instead, naive reinforcement learning algorithms typically converge to Pareto-dominated outcomes in even the simplest of social dilemmas. An emerging literature on opponent shaping has demonstrated the ability to reach prosocial outcomes by influencing the learning of other agents. However, such methods differentiate through the learning step of other agents or optimize for meta-game dynamics, which rely on privileged access to opponents' learning algorithms or exponential sample complexity, respectively. To provide a learning rule-agnostic and sample-efficient alternative, we introduce Reciprocators, reinforcement learning agents which are intrinsically motivated to reciprocate the influence of opponents' actions on their returns. This approach seeks to modify other agents' $Q$-values by increasing their return following beneficial actions (with respect to the Reciprocator) and decreasing it after detrimental actions, guiding them towards mutually beneficial actions without directly differentiating through a model of their policy. We show that Reciprocators can be used to promote cooperation in temporally extended social dilemmas during simultaneous learning. Our code is available at https://github.com/johnlyzhou/reciprocator/.