Adaptive Experimental Design for Policy Learning

📅 2024-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper studies contextual best-arm identification (contextual BAI) in context-adaptive experimental design: under a fixed sampling budget, dynamically allocating treatments based on covariates to recommend an individualized optimal policy that minimizes worst-case expected regret. We propose PLAS—a novel policy-learning-and-sampling framework—that jointly optimizes treatment allocation and policy estimation. PLAS is the first method to establish an asymptotically optimal theoretical foundation for contextual BAI. Its regret upper bound is shown to tightly match the information-theoretic lower bound, thereby achieving asymptotic optimality. The approach integrates techniques from best-arm identification, contextual bandits, asymptotic statistics, and regret analysis. By unifying statistical efficiency with computational feasibility, PLAS provides a new paradigm for evidence-driven, personalized policy design—rigorously grounded in theory yet practically implementable.

Technology Category

Application Category

📝 Abstract

Evidence-based targeting has been a topic of growing interest among the practitioners of policy and business. Formulating decision-maker's policy learning as a fixed-budget best arm identification (BAI) problem with contextual information, we study an optimal adaptive experimental design for policy learning with multiple treatment arms. In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units upon observing their contextual information (covariates). After the experiment, the planner recommends an individualized assignment rule to the population. Setting the worst-case expected regret as the performance criterion of adaptive sampling and recommended policies, we derive its asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy Learning strategy (PLAS), whose leading factor of the regret upper bound aligns with the lower bound as the size of experimental units increases.

Problem

Research questions and friction points this paper is trying to address.

Identify best treatment arm using contextual information

Minimize worst-case expected regret in policy learning

Develop optimal adaptive sampling strategy for experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Sampling-Policy Learning strategy

Minimax rate-optimal regret performance

Contextual best arm identification problem

🔎 Similar Papers

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning