AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitations of fixed discount factors, which fail to adapt to state-dependent temporal dynamics, and the instability often induced by existing state-dependent discounting methods, including training instability and TD error collapse. To overcome these challenges, the paper introduces AdaGamma, the first stable state-dependent discounting mechanism within deep actor-critic frameworks. AdaGamma jointly learns a state-dependent discount function alongside a return-consistency objective, regularizing the value backup structure to prevent degenerate target manipulation. Empirical evaluations based on SAC and PPO demonstrate consistent performance improvements across continuous control benchmarks. Furthermore, online A/B tests on JD Logistics’ real-world platform confirm statistically significant gains, validating the practical efficacy of the proposed approach.

📝 Abstract

The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.

Problem

Research questions and friction points this paper is trying to address.

state-dependent discounting

reinforcement learning

deep actor-critic

TD-error collapse

temporal adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

state-dependent discounting

return-consistency regularization

AdaGamma