🤖 AI Summary
This work addresses the challenge of constructing solvable MDPs from static datasets in offline reinforcement learning. We propose DAC-MDP, a framework that jointly estimates density-weighted state-action value functions via nonparametric MDP modeling and deep representation learning, augmented with a data-scarcity-aware cost mechanism. The method enables zero-shot goal transfer and multi-objective optimization, with theoretical performance lower bounds guaranteed. Evaluated across diverse benchmarks—including image-based observation tasks—DAC-MDP demonstrates significant improvements in generalization and robustness for large-scale, complex offline RL problems. Its core innovation lies in explicitly embedding the geometric structure of the data distribution into both MDP modeling and policy optimization, thereby enabling environment-free, hyperparameter-free, goal-adaptive policy learning.
📝 Abstract
We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.