🤖 AI Summary
Existing theory struggles to characterize the finite-sample convergence of asynchronous Categorical Temporal Difference (Categorical TD) learning with single-state updates under Markovian trajectories. This work addresses this gap by reformulating the algorithm—via isometric embedding—as a state-wise contractive stochastic approximation scheme. For the first time, non-asymptotic error bounds are established for asynchronous Categorical TD under both the Cramér distance (scalar case) and the Maximum Mean Discrepancy geometry (multivariate signed case). The analysis encompasses discounted settings with both i.i.d. and Markovian sampling, as well as fixed-horizon problems under episodic i.i.d. sampling. By proving the contraction property in the state-wise supremum norm, the study provides finite-iteration convergence guarantees across diverse sampling mechanisms, thereby bridging the gap between theoretical analysis and practical distributed reinforcement learning implementations.
📝 Abstract
Recent non-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full-state updates under a generative model, model-based estimators, accelerated variants, or different approximation architectures. Standard categorical temporal-difference learning is typically used in a different regime. It asynchronously performs a single-state update at each iteration and, in online settings, is driven by a Markovian trajectory. This leaves an important gap between existing finite-iteration theory and the categorical recursions most closely aligned with practical distributional temporal-difference implementations. We bridge this gap for two categorical policy-evaluation methods: scalar categorical temporal-difference learning in the Cramér geometry and multivariate signed-categorical temporal-difference learning in the maximum mean discrepancy geometry. After suitable isometric embeddings, both algorithms take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.