DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

📅 2020-10-18

🏛️ International Conference on Learning Representations

📈 Citations: 22

✨ Influential: 5

career value

233K/year

🤖 AI Summary

This work addresses the challenge of constructing solvable MDPs from static datasets in offline reinforcement learning. We propose DAC-MDP, a framework that jointly estimates density-weighted state-action value functions via nonparametric MDP modeling and deep representation learning, augmented with a data-scarcity-aware cost mechanism. The method enables zero-shot goal transfer and multi-objective optimization, with theoretical performance lower bounds guaranteed. Evaluated across diverse benchmarks—including image-based observation tasks—DAC-MDP demonstrates significant improvements in generalization and robustness for large-scale, complex offline RL problems. Its core innovation lies in explicitly embedding the geometric structure of the data distribution into both MDP modeling and policy optimization, thereby enabling environment-free, hyperparameter-free, goal-adaptive policy learning.

📝 Abstract

We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.

Problem

Research questions and friction points this paper is trying to address.

Offline reinforcement learning with finite MDPs

Deep Averagers with Costs MDP introduced

Scalable to large complex RL problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-parametric MDPs for offline RL

Deep Averagers with Costs MDP (DAC-MDP)

Zero-shot adaptation to new environments

🔎 Similar Papers

Efficient Off-Policy Learning for High-Dimensional Action Spaces