🤖 AI Summary
To address computational and sample-efficiency bottlenecks arising from long-history modeling in partially observable reinforcement learning (PORL), this paper introduces *memory traces*—a compact history representation based on exponential moving averages, replacing conventional finite-window observation histories. It is the first work to adapt the eligibility trace concept for history compression in partially observable Markov decision processes (POMDPs), establishing a rigorous theoretical connection between memory traces and Lipschitz-continuous value functions and proving superior sample complexity over window-based methods. The approach integrates offline on-policy evaluation, Lipschitz analysis, and online policy optimization. Experiments demonstrate that memory traces achieve faster convergence and stronger generalization in both value prediction and control tasks—particularly excelling in long-range dependency environments. The core contribution is a lightweight, theoretically grounded, and empirically effective history modeling paradigm that bridges interpretability and practicality.
📝 Abstract
Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with a finite window of observations quickly becomes intractable as the window length grows. In this work, we introduce memory traces. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline on-policy evaluation that quantify the value errors achieved with memory traces for the class of Lipschitz continuous value estimates. We establish a close connection to the window approach, and demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control.