AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

207K/year
πŸ€– AI Summary
This work addresses the limitations of fixed discount factors, which fail to adapt to state-dependent temporal dynamics, and the instability often induced by existing state-dependent discounting methods, including training instability and TD error collapse. To overcome these challenges, the paper introduces AdaGamma, the first stable state-dependent discounting mechanism within deep actor-critic frameworks. AdaGamma jointly learns a state-dependent discount function alongside a return-consistency objective, regularizing the value backup structure to prevent degenerate target manipulation. Empirical evaluations based on SAC and PPO demonstrate consistent performance improvements across continuous control benchmarks. Furthermore, online A/B tests on JD Logistics’ real-world platform confirm statistically significant gains, validating the practical efficacy of the proposed approach.
πŸ“ Abstract
The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.
Problem

Research questions and friction points this paper is trying to address.

state-dependent discounting
reinforcement learning
deep actor-critic
TD-error collapse
temporal adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

state-dependent discounting
return-consistency regularization
AdaGamma
deep actor-critic
Bellman operator
πŸ”Ž Similar Papers
No similar papers found.