Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline reinforcement learning, selecting the optimal value function or dynamics model from a candidate set for accurate off-policy evaluation (OPE) of a target policy remains an open challenge lacking theoretical guarantees. Method: We propose the first unified, theoretically grounded OPE selection framework—LSTD-Tournament—featuring dual model-agnostic and model-based selection pathways. It integrates Least-Squares Temporal Difference (LSTD), Fitted Q-Evaluation (FQE), importance sampling, and dynamics fitting within a tournament-style selection mechanism, accompanied by a rigorous bias-variance trade-off analysis. We further introduce a reproducible experimental protocol enabling controlled model misspecification and stable candidate model generation. Results: On Gym benchmarks, LSTD-Tournament reduces average OPE estimation error by 37%, significantly improving selection stability and accuracy. It constitutes the first OPE model selection approach with both provable theoretical guarantees and empirical superiority.

Technology Category

Application Category

📝 Abstract
Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). In this work we focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Our contributions are two fold. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation of candidate value functions, better control of misspecification, and evaluation of model-free and model-based methods alike. We exemplify the protocol on a Gym environment, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
Problem

Research questions and friction points this paper is trying to address.

Hyperparameter tuning for off-policy evaluation
Selection of candidate value functions or dynamics
Development of new experimental evaluation protocol
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-free and model-based selectors
New experimental protocol
Stable candidate value functions generation
🔎 Similar Papers
No similar papers found.
Pai Liu
Pai Liu
University of Rochester
AI4HealthcareWeb AgentLLM
L
Lingfeng Zhao
University of Illinois Urbana-Champaign
S
Shivangi Agarwal
Indraprastha Institute of Information Technology Delhi
J
Jinghan Liu
University of Science and Technology of China
A
Audrey Huang
University of Illinois Urbana-Champaign
P
P. Amortila
University of Illinois Urbana-Champaign
N
Nan Jiang
University of Illinois Urbana-Champaign