A Differential Perspective on Distributional Reinforcement Learning

๐Ÿ“… 2025-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work pioneers the extension of distributional reinforcement learning (DRL) to the average-reward setting, addressing the challenge of modeling both the long-term per-step reward distribution and the differential return distribution. We propose the first theoretically convergent distributional RL framework for average-reward MDPs, explicitly defining and learning the distribution of differential returnsโ€”thereby overcoming fundamental limitations of discounted formulations. Our method integrates quantile-based distribution representations with differential value function theory and employs a convergence-guaranteed tabular algorithm design, which we further generalize to a scalable family of approximate algorithms. Empirical evaluation across multiple benchmark tasks demonstrates competitive performance against non-distributional baselines, while accurately capturing reward distribution characteristics. The results empirically validate both the theoretical convergence guarantees and practical efficacy of the proposed framework.

Technology Category

Application Category

๐Ÿ“ Abstract
To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a potentially-discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time-step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms consistently yield competitive performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run reward and return distributions.
Problem

Research questions and friction points this paper is trying to address.

Extends distributional RL to average-reward optimization
Develops quantile-based algorithms for per-step reward distribution
Proves convergence for tabular prediction and control algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends distributional RL to average-reward setting
Uses quantile-based approach for reward distribution
Develops convergent tabular and scalable algorithms
๐Ÿ”Ž Similar Papers
No similar papers found.