๐ค AI Summary
This paper investigates the non-asymptotic statistical efficiency of policy evaluation in distributed distributional reinforcement learning. Addressing temporal difference (TD) learning under Markov decision processes, we first establish a Freedman-type inequality in Hilbert spaceโyielding a key concentration tool for non-i.i.d. stochastic processes. Leveraging this, we derive the minimax-optimal sample complexity $ ilde{O}(varepsilon^{-2}mu_{min}^{-1}(1-gamma)^{-3})$ for both nonparametric and categorical distributional TD algorithms under the 1-Wasserstein metric. Furthermore, we extend the analysis to the Markov data setting and propose a variance-reduced distributed TD variant, achieving state-of-the-art statistical accuracy matching that of classical policy evaluation methods.
๐ Abstract
Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution $eta^pi$ for a given policy $pi$. Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a $gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need $ ilde{O}(varepsilon^{-2}mu_{min}^{-1}(1-gamma)^{-3})$ interactions with the environment to achieve an $varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $1$-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors. In addition, we revisit categorical distributional TD (CTD), showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $1$-Wasserstein distance. We also extend our analysis to the more general setting where the data generating process is Markovian. In the Markovian setting, we propose variance-reduced variants of NTD and CTD, and show that both can achieve a $ ilde{O}(varepsilon^{-2} mu_{pi,min}^{-1}(1-gamma)^{-3}+t_{mix}mu_{pi,min}^{-1}(1-gamma)^{-1})$ sample complexity bounds in the case of the $1$-Wasserstein distance, which matches the state-of-the-art statistical results for classic policy evaluation. To achieve the sharp statistical rates, we establish a novel Freedman's inequality in Hilbert spaces. This new Freedman's inequality would be of independent interest for statistical analysis of various infinite-dimensional online learning problems.