Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs

📅 2024-10-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the limited contextual MDP generalization in zero-shot policy transfer (ZSPT), this paper identifies a fundamental trade-off: while increased exploration expands state coverage, it often degrades value-function estimation accuracy, thereby harming generalization. We formally define “reachability” to characterize generalization requirements and propose Explore-Go—a lightweight, episode-initial exploration module that jointly optimizes state coverage and value-function fidelity. Designed within the reinforcement learning framework, Explore-Go is compatible with mainstream on-policy and off-policy algorithms and extends naturally to partially observable settings. Experiments across multiple ZSPT benchmarks demonstrate significant performance gains over strong baselines. Explore-Go provides a plug-and-play, algorithm-agnostic generalization enhancement mechanism—requiring no architectural modifications or retraining of the base policy—and establishes a new state-of-the-art in zero-shot transfer robustness.

Technology Category

Application Category

📝 Abstract

In the zero-shot policy transfer (ZSPT) setting for contextual Markov decision processes (MDP), agents train on a fixed set of contexts and must generalise to new ones. Recent work has argued and demonstrated that increased exploration can improve this generalisation, by training on more states in the training contexts. In this paper, we demonstrate that training on more states can indeed improve generalisation, but can come at a cost of reducing the accuracy of the learned value function which should not benefit generalisation. We introduce reachability in the ZSPT setting to define which states/contexts require generalisation and explain why exploration can improve it. We hypothesise and demonstrate that using exploration to increase the agent's coverage while also increasing the accuracy improves generalisation even more. Inspired by this, we propose a method Explore-Go that implements an exploration phase at the beginning of each episode, which can be combined with existing on- and off-policy RL algorithms and significantly improves generalisation even in partially observable MDPs. We demonstrate the effectiveness of Explore-Go when combined with several popular algorithms and show an increase in generalisation performance across several environments. With this, we hope to provide practitioners with a simple modification that can improve the generalisation of their agents.

Problem

Research questions and friction points this paper is trying to address.

Improving generalisation in zero-shot policy transfer for contextual MDPs.

Balancing exploration and value function accuracy for better generalisation.

Proposing Explore-Go method to enhance agent generalisation in various environments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reachability in ZSPT for generalization

Proposes Explore-Go method for enhanced exploration

Combines exploration with existing RL algorithms

🔎 Similar Papers

How to Solve Contextual Goal-Oriented Problems with Offline Datasets?