Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the convergence limitations of differential temporal-difference (TD) learning in average-reward reinforcement learning, where existing results rely on local clock-based stepsize schedules, hindering practical applicability beyond tabular settings. The paper establishes, for the first time, that on-policy n-step differential TD converges almost surely under standard diminishing stepsizes for any n. Furthermore, it provides three sufficient conditions for off-policy convergence that dispense entirely with local clock requirements. By integrating stochastic approximation theory, martingale convergence theorems, and structural properties of average-reward Markov decision processes, this study substantially strengthens the theoretical foundation of differential TD methods, rendering their convergence analysis more aligned with real-world applications and extendable to a broader range of reinforcement learning scenarios.

Technology Category

Application Category

📝 Abstract

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

Problem

Research questions and friction points this paper is trying to address.

average reward

differential temporal difference learning

almost sure convergence

Markov decision processes

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

differential temporal difference learning

average reward MDP

almost sure convergence