Joint MDPs and Reinforcement Learning in Coupled-Dynamics Environments

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Markov decision processes (MDPs) model only the marginal distribution of actions, making it difficult to capture the joint dependencies among counterfactual one-step outcomes under multiple actions. To address this limitation, this work proposes a formal framework termed Joint MDP (JMDP), which introduces a multi-action generation interface to explicitly model the coupled dynamics among actions under shared exogenous randomness. Under a one-step coupling assumption, the authors derive Bellman operators for higher-order moments of returns and develop corresponding dynamic programming and incremental learning algorithms with convergence guarantees. This study establishes the first theoretical foundation for modeling joint distributions in reinforcement learning, enabling accurate estimation of higher-order return moments.

Technology Category

Application Category

📝 Abstract
Many distributional quantities in reinforcement learning are intrinsically joint across actions, including distributions of gaps and probabilities of superiority. However, the classical Markov decision process (MDP) formalism specifies only marginal laws and leaves the joint law of counterfactual one-step outcomes across multiple possible actions at a state unspecified. We study coupled-dynamics environments with a multi-action generative interface which can sample counterfactual one-step outcomes for multiple actions under shared exogenous randomness. We propose joint MDPs (JMDPs) as a formalism for such environments by augmenting an MDP with a multi-action sample transition model which specifies a coupling of one-step counterfactual outcomes, while preserving standard MDP interaction as marginal observations. We adopt and formalize a one-step coupling regime where dependence across actions is confined to immediate counterfactual outcomes at the queried state. In this regime, we derive Bellman operators for $n$th-order return moments, providing dynamic programming and incremental algorithms with convergence guarantees.
Problem

Research questions and friction points this paper is trying to address.

joint MDP
coupled dynamics
counterfactual outcomes
reinforcement learning
Markov decision process
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint MDPs
coupled dynamics
counterfactual outcomes
multi-action generative model
Bellman operators for moments
🔎 Similar Papers
No similar papers found.