🤖 AI Summary
This work addresses the challenge of suboptimal policies in offline reinforcement learning under high-stakes settings, where hidden confounders bias policy evaluation and improvement. To tackle this issue, the authors propose a novel framework that integrates instrumental variables with conditional moment restrictions to model causal effects, thereby relaxing the strong unconfoundedness assumptions commonly required in imitation learning. For the first time, double machine learning is introduced into this setting to enable robust off-policy estimation. Furthermore, the paper develops a provably optimal and sample-efficient algorithm tailored to Linear Temporal Logic (LTL) objectives. Theoretical analysis establishes convergence and optimality guarantees, while empirical evaluations on standard RL benchmarks and synthetic/semi-synthetic datasets demonstrate significant improvements over existing methods—achieving higher sample efficiency without compromising policy safety or performance.
📝 Abstract
The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.