Reinforcement learning with non-ergodic reward increments: robustness via ergodicity transformations

📅 2023-10-17

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

In reinforcement learning, non-uniform reward distributions—particularly heavy-tailed ones—cause severe instability and catastrophic failure in single-trajectory policy optimization. To address this, we propose a data-driven ergodicity transformation method that enables robust, trajectory-level optimization under non-ergodic reward settings for the first time. Our approach employs a learnable mapping to transform raw reward sequences into ergodic-equivalent forms, ensuring that incremental expectations align with single-trajectory average returns. The end-to-end ergodicity mapping module integrates time-series modeling, resampling-based estimation, and policy gradient optimization. Empirical evaluation on heavy-tailed return tasks and standard benchmarks demonstrates that our method significantly reduces catastrophic failure probability, decreases single-trajectory performance variance by over 60%, and substantially improves long-term policy robustness.

📝 Abstract

Envisioned application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance, which all require RL agents to make decisions in the real world. A significant challenge hindering the adoption of RL methods in these domains is the non-robustness of conventional algorithms. In particular, the focus of RL is typically on the expected value of the return. The expected value is the average over the statistical ensemble of infinitely many trajectories, which can be uninformative about the performance of the average individual. For instance, when we have a heavy-tailed return distribution, the ensemble average can be dominated by rare extreme events. Consequently, optimizing the expected value can lead to policies that yield exceptionally high returns with a probability that approaches zero but almost surely result in catastrophic outcomes in single long trajectories. In this paper, we develop an algorithm that lets RL agents optimize the long-term performance of individual trajectories. The algorithm enables the agents to learn robust policies, which we show in an instructive example with a heavy-tailed return distribution and standard RL benchmarks. The key element of the algorithm is a transformation that we learn from data. This transformation turns the time series of collected returns into one for whose increments expected value and the average over a long trajectory coincide. Optimizing these increments results in robust policies.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Reward Distribution

Robust Decision Policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Reward Stabilization

Robust Decision-Making

🔎 Similar Papers

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards