Distributional Successor Features Enable Zero-Shot Policy Optimization

📅 2024-03-10

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Addressing the challenge of cross-reward zero-shot transfer in offline reinforcement learning—where rollout-based prediction accumulates error and policies struggle to adapt instantaneously to new reward functions—this paper proposes the Distributional Successor Policy Optimization (DiSPO) framework. DiSPO decouples policy optimization into two components: (1) diffusion-model-based modeling of the successor feature distribution, eliminating error-prone multi-step rollouts; and (2) action generation linearly reweighted by the target reward. By jointly learning state-action representations and explicitly modeling the successor feature distribution under the behavior policy, DiSPO achieves strict zero-shot transfer without fine-tuning. It substantially outperforms existing offline RL and successor-feature methods across diverse simulated robotic tasks and provides a theoretical lower bound on policy performance.

Technology Category

Application Category

📝 Abstract

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.

Problem

Research questions and friction points this paper is trying to address.

Model-based Reinforcement Learning

Cumulative Error Mitigation

Rapid Task Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiSPOs method

successor feature learning

diffusion models for reinforcement learning

🔎 Similar Papers

Zero-Shot Generalization of Vision-Based RL Without Data Augmentation