Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

📅 2024-11-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies offline distributionally robust reinforcement learning under non-Markov decision processes (NMDPs), aiming to learn policies with performance guarantees against the worst-case environment within an uncertainty set, using only a fixed offline dataset. Methodologically, it introduces the first dual characterization of non-Markov robust value functions; proposes a synergistic optimization framework combining data distillation with lower-confidence-bound (LCB) regularization; and defines two novel concentrability coefficients to overcome key analytical bottlenecks in non-Markov settings. Theoretically, it establishes the first $O(1/varepsilon^2)$ optimal sample complexity for NMDPs, rigorously proving that an $varepsilon$-optimal robust policy is learnable from polynomially many samples—applicable to both low-rank and general (non-low-rank) NMDP settings.

Technology Category

Application Category

📝 Abstract
Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an $epsilon$-optimal robust policy using $O(1/epsilon^2)$ offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.
Problem

Research questions and friction points this paper is trying to address.

Offline Reinforcement Learning
Non-Markov Decision Processes
Robust Strategy Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-Markov Decision Processes
Distributionally Robust Offline Learning
Adversarial Value Estimation
🔎 Similar Papers
No similar papers found.
Ruiquan Huang
Ruiquan Huang
Penn State University
machine learning
Y
Yingbin Liang
Department of Electrical and Computer Engineering, The Ohio State University
J
Jing Yang
Department of Electrical Engineering, The Pennsylvania State University