🤖 AI Summary
This work addresses two fundamental challenges in distributional reinforcement learning (DRL) for finite-horizon Markov decision processes (MDPs): the lack of a rigorous theoretical foundation and the intractability of infinite-dimensional distribution representations. We propose the first provably efficient general function approximation framework for DRL. Methodologically, we (1) introduce the notion of Bellman unbiasedness and rigorously prove that moment functions uniquely and completely characterize the statistical properties of return distributions; (2) establish a unified analytical framework integrating functional representation via moments, Eluder dimension to quantify function class complexity, and moment-based least-squares value iteration (SF-LSVI); and (3) derive a tight regret bound of Õ(d_E H^{3/2} √K), where d_E is the Eluder dimension, H the horizon, and K the number of episodes. This is the first result to establish both learnability and computational tractability of DRL under general function approximation.
📝 Abstract
Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of $ extit{Bellman unbiasedness}$ which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, $ exttt{SF-LSVI}$, that achieves a tight regret bound of $ ilde{O}(d_E H^{frac{3}{2}}sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.