Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
In average-reward reinforcement learning, the state-dependent bias is defined only up to an additive constant, rendering distributional modeling ill-posed. This work introduces, for the first time, a quotient space perspective by treating bias distributions as translation equivalence classes and proposes a categorical parameterization that respects this symmetry. Building upon this, we construct a well-defined projected average-reward distributional operator and its corresponding sampling-based recursive algorithm. Leveraging the Cramér distance, asynchronous stochastic approximation theory, and Markovian sampling analysis, we establish that under ideal centered rewards, the temporal-difference updates converge almost surely with bounded iterative residuals. Furthermore, the recursive scheme coupled with online gain estimation preserves non-expansiveness and convergence, thereby resolving the theoretical challenges in the average-reward setting caused by bias indeterminacy.
📝 Abstract
Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.
Problem

Research questions and friction points this paper is trying to address.

average-reward reinforcement learning
distributional reinforcement learning
bias identifiability
quotient space
Bellman compatibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

quotient-space
distributional reinforcement learning
average-reward
categorical representation
non-expansive operator