Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of policy evaluation bias in offline partially observable Markov decision processes with unobserved confounding. Direct learning from observational data often yields biased estimates due to latent variables. To overcome this, the authors formulate bridge function learning as an estimation problem subject to conditional moment restrictions and propose a two-stage estimator based on K-fold cross-fitting. This approach preserves policy identifiability while improving data efficiency. Notably, the method establishes the first oracle comparison bound for bridge function estimators, decomposing the estimation error into two components: auxiliary model estimation in Stage I and empirical averaging in Stage II. Theoretical analysis demonstrates that the proposed estimator enjoys superior statistical properties compared to existing approaches.
📝 Abstract
Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a $K$-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.
Problem

Research questions and friction points this paper is trying to address.

confounded POMDPs
bridge functions
conditional moment restrictions
offline reinforcement learning
hidden confounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-fitting
conditional moment restrictions
bridge functions
model-based reinforcement learning
partially observable MDPs
🔎 Similar Papers
No similar papers found.