Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses long-horizon policy value evaluation under domain shift in semiparametric Markov decision processes (MDPs). We propose Automatic Double Machine Learning (ADML), the first ADML framework extended to infinite-horizon MDPs. Our method integrates flexible Q-function estimation, Riesz representer learning, isotonic calibration, and fitted Q-iteration—without requiring prior knowledge of the Riesz representer’s functional form—enabling model-adaptive, semiparametrically efficient (and even super-efficient) estimation. Compared with conventional approaches, it substantially relaxes state distribution overlap requirements, lowers the efficiency bound, and improves estimation accuracy. Crucially, it enables robust long-horizon policy evaluation even from short-trajectory data and in novel domains. The estimator supports nonparametrically efficient—and in certain settings, super-efficient—statistical inference, providing a novel tool for long-term causal inference under domain adaptation.

Technology Category

Application Category

📝 Abstract
Double reinforcement learning (DRL) enables statistically efficient inference on the value of a policy in a nonparametric Markov Decision Process (MDP) given trajectories generated by another policy. However, this approach necessarily requires stringent overlap between the state distributions, which is often violated in practice. To relax this requirement and extend DRL, we study efficient inference on linear functionals of the $Q$-function (of which policy value is a special case) in infinite-horizon, time-invariant MDPs under semiparametric restrictions on the $Q$-function. These restrictions can reduce the overlap requirement and lower the efficiency bound, yielding more precise estimates. As an important example, we study the evaluation of long-term value under domain adaptation, given a few short trajectories from the new domain and restrictions on the difference between the domains. This can be used for long-term causal inference. Our method combines flexible estimates of the $Q$-function and the Riesz representer of the functional of interest (e.g., the stationary state density ratio for policy value) and is automatic in that we do not need to know the form of the latter - only the functional we care about. To address potential model misspecification bias, we extend the adaptive debiased machine learning (ADML) framework of citet{van2023adaptive} to construct nonparametrically valid and superefficient estimators that adapt to the functional form of the $Q$-function. As a special case, we propose a novel adaptive debiased plug-in estimator that uses isotonic-calibrated fitted $Q$-iteration - a new calibration algorithm for MDPs - to circumvent the computational challenges of estimating debiasing nuisances from min-max objectives.
Problem

Research questions and friction points this paper is trying to address.

Semi-parametric Markov Decision Processes
Double Reinforcement Learning
Policy Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Learning
Semi-parametric Markov Decision Processes
Long-term Causal Understanding