Understanding Goal Generalisation in Sequential Reinforcement Learning

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Reinforcement learning agents often exhibit unpredictable goal-directed behaviors in out-of-distribution (OOD) environments, and the mechanisms underlying their generalization remain poorly understood. This work addresses this gap by adopting a developmental perspective, analyzing over one hundred sequential training curricula across more than 250 OOD environments to reveal the persistent influence of early-acquired goals on subsequent behavior. The authors propose a latent policy gradient method that leverages low-dimensional latent variables to effectively predict agent behaviors under unseen training curricula. This approach not only achieves high prediction accuracy and strong generalization capability but also offers interpretable insights into the mechanisms of goal generalization. Notably, it is the first to systematically uncover the structural regularities governing goal generalization in sequential training settings.

📝 Abstract

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

Problem

Research questions and friction points this paper is trying to address.

goal generalisation

sequential reinforcement learning

out-of-distribution behaviour

training pipeline

latent policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent policy gradients

goal generalisation

sequential reinforcement learning