Distributional Successor Features Enable Zero-Shot Policy Optimization

📅 2024-03-10
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of cross-reward zero-shot transfer in offline reinforcement learning—where rollout-based prediction accumulates error and policies struggle to adapt instantaneously to new reward functions—this paper proposes the Distributional Successor Policy Optimization (DiSPO) framework. DiSPO decouples policy optimization into two components: (1) diffusion-model-based modeling of the successor feature distribution, eliminating error-prone multi-step rollouts; and (2) action generation linearly reweighted by the target reward. By jointly learning state-action representations and explicitly modeling the successor feature distribution under the behavior policy, DiSPO achieves strict zero-shot transfer without fine-tuning. It substantially outperforms existing offline RL and successor-feature methods across diverse simulated robotic tasks and provides a theoretical lower bound on policy performance.

Technology Category

Application Category

📝 Abstract
Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.
Problem

Research questions and friction points this paper is trying to address.

Model-based Reinforcement Learning
Cumulative Error Mitigation
Rapid Task Adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiSPOs method
successor feature learning
diffusion models for reinforcement learning
🔎 Similar Papers
Chuning Zhu
Chuning Zhu
University of Washington
Reinforcement learningRobotics
X
Xinqi Wang
University of Washington
Tyler Han
Tyler Han
Graduate Student, University of Washington
roboticsimitation learningcontrols
S
S. S. Du
University of Washington
A
Abhishek Gupta
University of Washington