Upside Down Reinforcement Learning with Policy Generators

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low policy-generation efficiency and poor generalization in command-driven reinforcement learning. We propose a critic-free, end-to-end framework. Our key contributions are: (1) a Hypernetworks-based, command-conditioned policy generator that directly synthesizes neural network policies satisfying specified return targets; (2) a decoupled sampling mechanism that separates buffer sampling probability from policy count, incorporating weighted sampling to enhance training stability; and (3) zero-shot generalization to unseen return targets within the Universal Dynamic Reward Learning (UDRL) framework. Experiments on multiple benchmark tasks demonstrate substantial improvements in sample efficiency and high-return policy performance, alongside strong cross-target generalization capability—achieving effective policy synthesis for novel return specifications without additional fine-tuning.

Technology Category

Application Category

📝 Abstract
Upside Down Reinforcement Learning (UDRL) is a promising framework for solving reinforcement learning problems which focuses on learning command-conditioned policies. In this work, we extend UDRL to the task of learning a command-conditioned generator of deep neural network policies. We accomplish this using Hypernetworks - a variant of Fast Weight Programmers, which learn to decode input commands representing a desired expected return into command-specific weight matrices. Our method, dubbed Upside Down Reinforcement Learning with Policy Generators (UDRLPG), streamlines comparable techniques by removing the need for an evaluator or critic to update the weights of the generator. To counteract the increased variance in last returns caused by not having an evaluator, we decouple the sampling probability of the buffer from the absolute number of policies in it, which, together with a simple weighting strategy, improves the empirical convergence of the algorithm. Compared with existing algorithms, UDRLPG achieves competitive performance and high returns, sometimes outperforming more complex architectures. Our experiments show that a trained generator can generalize to create policies that achieve unseen returns zero-shot. The proposed method appears to be effective in mitigating some of the challenges associated with learning highly multimodal functions. Altogether, we believe that UDRLPG represents a promising step forward in achieving greater empirical sample efficiency in RL. A full implementation of UDRLPG is publicly available at https://github.com/JacopoD/udrlpg_
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Complex Behavior
Decision-making Skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

UDRLPG
Hypernetworks
Policy Generation
🔎 Similar Papers
No similar papers found.
J
Jacopo Di Ventura
Universita della Svizzera italiana, Lugano, Switzerland; The Swiss AI Lab IDSIA (USI-SUPSI), Lugano, Switzerland; Scuola universitaria professionale della Svizzera italiana, Lugano, Switzerland; Center of Excellence for Generative AI, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; Now at Leiden University
Dylan R. Ashley
Dylan R. Ashley
Ph.D. Student, Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI-SUPSI)
Reinforcement LearningDeep LearningMachine LearningArtificial Intelligence
Francesco Faccio
Francesco Faccio
Senior Research Scientist, Google DeepMind
Reinforcement LearningDeep LearningNeural Networks
V
Vincent Herrmann
Universita della Svizzera italiana, Lugano, Switzerland; The Swiss AI Lab IDSIA (USI-SUPSI), Lugano, Switzerland; Scuola universitaria professionale della Svizzera italiana, Lugano, Switzerland
J
Jurgen Schmidhuber
Universita della Svizzera italiana, Lugano, Switzerland; The Swiss AI Lab IDSIA (USI-SUPSI), Lugano, Switzerland; Scuola universitaria professionale della Svizzera italiana, Lugano, Switzerland; Center of Excellence for Generative AI, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; NNAISENSE, Lugano, Switzerland