Model-Based Reinforcement Learning Under Confounding

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In contextually unobserved confounded contextual Markov decision processes (C-MDPs), conventional model-based reinforcement learning suffers from fundamental inconsistency due to mismatch between the behavior policy and the intervention target. To address this, we propose a causally consistent model learning and planning framework: (i) proximal offline policy evaluation via proxy variables; (ii) construction of a behavior-averaged transition model to define an identifiable surrogate MDP; and (iii) joint modeling and optimization grounded in the maximum causal entropy principle. Our approach is the first to enable unbiased state-policy modeling and causally consistent Bellman iteration without observing confounding context. It yields consistent estimators for both reward and transition functions, thereby substantially improving the stability and reliability of policy evaluation and planning under unmeasured confounding.

Technology Category

Application Category

📝 Abstract
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
Problem

Research questions and friction points this paper is trying to address.

Addresses confounding in offline model-based reinforcement learning with unobserved contexts
Proposes a surrogate MDP using proximal evaluation for consistent policy assessment
Enables principled learning and planning in confounded environments without observed context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximal off-policy evaluation for confounded reward identification
Behavior-averaged transition model for surrogate MDP construction
Integration with MaxCausalEnt framework for principled model learning
🔎 Similar Papers
No similar papers found.
N
Nishanth Venkatesh
Department of Systems Engineering, Cornell University, Ithaca, NY 14850 USA
Andreas A. Malikopoulos
Andreas A. Malikopoulos
Professor, Cornell University
Decentralized controllearning-based controlcyber-physical systemsemerging mobility systems