Unsupervised Meta-Testing With Conditional Neural Processes for Hybrid Meta-Reinforcement Learning

📅 2024-10-01

🏛️ IEEE Robotics and Automation Letters

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address low sample efficiency in meta-testing due to the absence of reward signals, this paper proposes UMCNP, an unsupervised meta-reinforcement learning framework. UMCNP integrates policy gradient optimization with task inference, enabling implicit environment dynamics modeling and adaptive policy optimization from a single trajectory of an unseen task. Its key contributions are threefold: (1) decoupling policy learning from task inference to support offline reuse of meta-training data; (2) employing Conditional Neural Processes (CNPs) for unsupervised task representation learning; and (3) combining parameterized policy gradients with a model-predictive control–inspired self-generated rollout mechanism. Evaluated on benchmarks—including 2D point navigation, biased-sensor CartPole, and dynamics-randomized Walker—UMCNP reduces meta-test sample requirements by over 50% while significantly improving few-shot adaptation performance.

Technology Category

Application Category

📝 Abstract

We introduce Unsupervised Meta-Testing with Conditional Neural Processes (UMCNP), a novel hybrid few-shot meta-reinforcement learning (meta-RL) method that uniquely combines, yet distinctly separates, parameterized policy gradient-based (PPG) and task inference-based few-shot meta-RL. Tailored for settings where the reward signal is missing during meta-testing, our method increases sample efficiency without requiring additional samples in meta-training. UMCNP leverages the efficiency and scalability of Conditional Neural Processes (CNPs) to reduce the number of online interactions required in meta-testing. During meta-training, samples previously collected through PPG meta-RL are efficiently reused for learning task inference in an offline manner. UMCNP infers the latent representation of the transition dynamics model from a single test task rollout with unknown parameters. This approach allows us to generate rollouts for self-adaptation by interacting with the learned dynamics model. We demonstrate our method can adapt to an unseen test task using significantly fewer samples during meta-testing than the baselines in 2D-Point Agent and continuous control meta-RL benchmarks, namely, cartpole with unknown angle sensor bias, walker agent with randomized dynamics parameters.

Problem

Research questions and friction points this paper is trying to address.

Develop hybrid meta-RL method for missing reward signals

Improve sample efficiency without extra meta-training samples

Adapt to unseen tasks with fewer test samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines policy gradient and task inference meta-RL

Uses Conditional Neural Processes for efficiency

Infers dynamics from single rollout for adaptation

🔎 Similar Papers

A Tutorial on Meta-Reinforcement Learning