Bellman Unbiasedness: Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

📅 2024-07-31

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses two fundamental challenges in distributional reinforcement learning (DRL) for finite-horizon Markov decision processes (MDPs): the lack of a rigorous theoretical foundation and the intractability of infinite-dimensional distribution representations. We propose the first provably efficient general function approximation framework for DRL. Methodologically, we (1) introduce the notion of Bellman unbiasedness and rigorously prove that moment functions uniquely and completely characterize the statistical properties of return distributions; (2) establish a unified analytical framework integrating functional representation via moments, Eluder dimension to quantify function class complexity, and moment-based least-squares value iteration (SF-LSVI); and (3) derive a tight regret bound of Õ(d_E H^{3/2} √K), where d_E is the Eluder dimension, H the horizon, and K the number of episodes. This is the first result to establish both learnability and computational tractability of DRL under general function approximation.

Technology Category

Application Category

📝 Abstract

Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of $ extit{Bellman unbiasedness}$ which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, $ exttt{SF-LSVI}$, that achieves a tight regret bound of $ ilde{O}(d_E H^{frac{3}{2}}sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.

Problem

Research questions and friction points this paper is trying to address.

Theoretical understanding of distributional reinforcement learning's effectiveness is lacking

Infinite dimensionality of distributions poses intractable challenges

Efficient online learning with exact distributional updates is needed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bellman unbiasedness enables efficient distributional updates

Moment functionals capture statistical information exactly

SF-LSVI algorithm achieves tight regret bound

🔎 Similar Papers

No similar papers found.