🤖 AI Summary
Addressing the challenge of cross-reward zero-shot transfer in offline reinforcement learning—where rollout-based prediction accumulates error and policies struggle to adapt instantaneously to new reward functions—this paper proposes the Distributional Successor Policy Optimization (DiSPO) framework. DiSPO decouples policy optimization into two components: (1) diffusion-model-based modeling of the successor feature distribution, eliminating error-prone multi-step rollouts; and (2) action generation linearly reweighted by the target reward. By jointly learning state-action representations and explicitly modeling the successor feature distribution under the behavior policy, DiSPO achieves strict zero-shot transfer without fine-tuning. It substantially outperforms existing offline RL and successor-feature methods across diverse simulated robotic tasks and provides a theoretical lower bound on policy performance.
📝 Abstract
Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.