Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces

๐Ÿ“… 2024-03-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

207K/year
๐Ÿค– AI Summary
This paper investigates the non-asymptotic statistical efficiency of policy evaluation in distributed distributional reinforcement learning. Addressing temporal difference (TD) learning under Markov decision processes, we first establish a Freedman-type inequality in Hilbert spaceโ€”yielding a key concentration tool for non-i.i.d. stochastic processes. Leveraging this, we derive the minimax-optimal sample complexity $ ilde{O}(varepsilon^{-2}mu_{min}^{-1}(1-gamma)^{-3})$ for both nonparametric and categorical distributional TD algorithms under the 1-Wasserstein metric. Furthermore, we extend the analysis to the Markov data setting and propose a variance-reduced distributed TD variant, achieving state-of-the-art statistical accuracy matching that of classical policy evaluation methods.

Technology Category

Application Category

๐Ÿ“ Abstract
Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution $eta^pi$ for a given policy $pi$. Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a $gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need $ ilde{O}(varepsilon^{-2}mu_{min}^{-1}(1-gamma)^{-3})$ interactions with the environment to achieve an $varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $1$-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors. In addition, we revisit categorical distributional TD (CTD), showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $1$-Wasserstein distance. We also extend our analysis to the more general setting where the data generating process is Markovian. In the Markovian setting, we propose variance-reduced variants of NTD and CTD, and show that both can achieve a $ ilde{O}(varepsilon^{-2} mu_{pi,min}^{-1}(1-gamma)^{-3}+t_{mix}mu_{pi,min}^{-1}(1-gamma)^{-1})$ sample complexity bounds in the case of the $1$-Wasserstein distance, which matches the state-of-the-art statistical results for classic policy evaluation. To achieve the sharp statistical rates, we establish a novel Freedman's inequality in Hilbert spaces. This new Freedman's inequality would be of independent interest for statistical analysis of various infinite-dimensional online learning problems.
Problem

Research questions and friction points this paper is trying to address.

Distributed Temporal Difference Learning
Markov Decision Processes
Hilbert Space Inequalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nonparametric Temporal Difference Learning
Optimal Statistical Efficiency
Hilbert Space Freedman Inequality
๐Ÿ”Ž Similar Papers
No similar papers found.