🤖 AI Summary
To address the limited contextual MDP generalization in zero-shot policy transfer (ZSPT), this paper identifies a fundamental trade-off: while increased exploration expands state coverage, it often degrades value-function estimation accuracy, thereby harming generalization. We formally define “reachability” to characterize generalization requirements and propose Explore-Go—a lightweight, episode-initial exploration module that jointly optimizes state coverage and value-function fidelity. Designed within the reinforcement learning framework, Explore-Go is compatible with mainstream on-policy and off-policy algorithms and extends naturally to partially observable settings. Experiments across multiple ZSPT benchmarks demonstrate significant performance gains over strong baselines. Explore-Go provides a plug-and-play, algorithm-agnostic generalization enhancement mechanism—requiring no architectural modifications or retraining of the base policy—and establishes a new state-of-the-art in zero-shot transfer robustness.
📝 Abstract
In the zero-shot policy transfer (ZSPT) setting for contextual Markov decision processes (MDP), agents train on a fixed set of contexts and must generalise to new ones. Recent work has argued and demonstrated that increased exploration can improve this generalisation, by training on more states in the training contexts. In this paper, we demonstrate that training on more states can indeed improve generalisation, but can come at a cost of reducing the accuracy of the learned value function which should not benefit generalisation. We introduce reachability in the ZSPT setting to define which states/contexts require generalisation and explain why exploration can improve it. We hypothesise and demonstrate that using exploration to increase the agent's coverage while also increasing the accuracy improves generalisation even more. Inspired by this, we propose a method Explore-Go that implements an exploration phase at the beginning of each episode, which can be combined with existing on- and off-policy RL algorithms and significantly improves generalisation even in partially observable MDPs. We demonstrate the effectiveness of Explore-Go when combined with several popular algorithms and show an increase in generalisation performance across several environments. With this, we hope to provide practitioners with a simple modification that can improve the generalisation of their agents.