🤖 AI Summary
This work addresses the limitations of existing contrastive reinforcement learning methods, which are largely confined to off-policy settings and continuous action spaces, and lack effective self-supervised mechanisms for on-policy frameworks, discrete actions, and multi-agent scenarios. The paper proposes Contrastive Proximal Policy Optimization (CPPO), the first approach to integrate contrastive learning directly with on-policy optimization. CPPO constructs Q-values and derives policy advantages through contrastive representations of state-action-goal triplets, eliminating the need for handcrafted rewards or experience replay. The method unifies support for both continuous and discrete action spaces as well as single- and multi-agent settings. Evaluated across 18 tasks, CPPO significantly outperforms existing contrastive RL methods in 14 and matches or exceeds the performance of reward-intensive PPO in 12.
📝 Abstract
Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}