🤖 AI Summary
This paper studies offline distributionally robust reinforcement learning under non-Markov decision processes (NMDPs), aiming to learn policies with performance guarantees against the worst-case environment within an uncertainty set, using only a fixed offline dataset. Methodologically, it introduces the first dual characterization of non-Markov robust value functions; proposes a synergistic optimization framework combining data distillation with lower-confidence-bound (LCB) regularization; and defines two novel concentrability coefficients to overcome key analytical bottlenecks in non-Markov settings. Theoretically, it establishes the first $O(1/varepsilon^2)$ optimal sample complexity for NMDPs, rigorously proving that an $varepsilon$-optimal robust policy is learnable from polynomially many samples—applicable to both low-rank and general (non-low-rank) NMDP settings.
📝 Abstract
Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an $epsilon$-optimal robust policy using $O(1/epsilon^2)$ offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.