Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces

📅 2024-03-09

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This paper investigates the non-asymptotic statistical efficiency of policy evaluation in distributed distributional reinforcement learning. Addressing temporal difference (TD) learning under Markov decision processes, we first establish a Freedman-type inequality in Hilbert space—yielding a key concentration tool for non-i.i.d. stochastic processes. Leveraging this, we derive the minimax-optimal sample complexity $ ilde{O}(varepsilon^{-2}mu_{min}^{-1}(1-gamma)^{-3})$ for both nonparametric and categorical distributional TD algorithms under the 1-Wasserstein metric. Furthermore, we extend the analysis to the Markov data setting and propose a variance-reduced distributed TD variant, achieving state-of-the-art statistical accuracy matching that of classical policy evaluation methods.

Technology Category

Application Category

📝 Abstract

Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution $eta^pi$ for a given policy $pi$. Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a $gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need $ ilde{O}(varepsilon^{-2}mu_{min}^{-1}(1-gamma)^{-3})$ interactions with the environment to achieve an $varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $1$-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors. In addition, we revisit categorical distributional TD (CTD), showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $1$-Wasserstein distance. We also extend our analysis to the more general setting where the data generating process is Markovian. In the Markovian setting, we propose variance-reduced variants of NTD and CTD, and show that both can achieve a $ ilde{O}(varepsilon^{-2} mu_{pi,min}^{-1}(1-gamma)^{-3}+t_{mix}mu_{pi,min}^{-1}(1-gamma)^{-1})$ sample complexity bounds in the case of the $1$-Wasserstein distance, which matches the state-of-the-art statistical results for classic policy evaluation. To achieve the sharp statistical rates, we establish a novel Freedman's inequality in Hilbert spaces. This new Freedman's inequality would be of independent interest for statistical analysis of various infinite-dimensional online learning problems.

Problem

Research questions and friction points this paper is trying to address.

Distributed Temporal Difference Learning

Markov Decision Processes

Hilbert Space Inequalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Nonparametric Temporal Difference Learning

Optimal Statistical Efficiency

Hilbert Space Freedman Inequality

🔎 Similar Papers

No similar papers found.