🤖 AI Summary
This work addresses the challenge of modeling long-term dependencies in non-Markovian decision processes, where observations and rewards depend on the full interaction history, rendering conventional reinforcement learning methods ineffective. The authors propose a reward-oriented Agent State Markov (ASM) policy framework that recursively updates an internal state representation and jointly optimizes this representation with the control policy. Within this framework, they establish the first policy gradient theorem applicable to non-Markovian environments. Building on this theoretical foundation, they develop an end-to-end ASMPG algorithm and provide rigorous proofs of its finite-time and almost sure convergence. Empirical evaluations demonstrate that the proposed method significantly outperforms baseline approaches relying on predictive objectives for state representation across multiple non-Markovian tasks, confirming the effectiveness and superiority of the ASM paradigm in handling history-dependent dynamics.
📝 Abstract
We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.