🤖 AI Summary
In real-world environments, raw observations often violate the Markov property. This work addresses this challenge by leveraging the causal directed acyclic graph (DAG) of observed variables to construct, for the first time, a theoretically minimal Markovian state representation. The authors propose the Multi-Order State Exposure (MOSE) mechanism, which enables end-to-end training by jointly feeding multi-order historical states into a single Q-function. Both theoretical analysis and empirical results demonstrate that minimality alone is insufficient for performance gains; instead, controlled redundancy must be introduced to effectively exploit the underlying causal structure. Evaluated across multiple benchmark and synthetic datasets, MOSE significantly outperforms both single-window and purely minimal-state approaches, with further improvements achieved when combined with the proposed minimal representation.
📝 Abstract
Online reinforcement learning (RL) relies on the Markov property for guaranteed performance, but real-world applications often lack well-defined states given raw observed variables. While causal RL has attracted growing interest, existing work typically assumes Markovian states are provided and focuses on using causality to accelerate learning, leaving a fundamental gap: \emph{given a longitudinal causal graph over observed variables, how does one construct MDP states that provably satisfy the Markov property?} We address this by providing a procedure that constructs a provably minimal state representation. In deep RL, we observe that the minimal representation alone empirically fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. To address this, we propose \textbf{MOSE} (Multi-Order State Exposure), which feeds multi-order historical state constructions into the same $Q$-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. Our results establish a core principle for causal deep RL: minimal sufficiency is not enough, and \emph{controlled redundancy} is necessary to unlock the benefit of causal state information.