Cramming Contextual Bandits for On-policy Statistical Evaluation

📅 2024-03-11

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This paper addresses the challenge of on-policy statistical evaluation of policies in contextual multi-armed bandits. We propose CRAM—the first efficient, single-pass, online on-policy evaluation method—designed to estimate policy performance directly from the algorithm’s own generated data stream, diverging from conventional off-policy evaluation paradigms. CRAM introduces a novel on-policy evaluation framework, theoretically guaranteeing consistency and asymptotic normality of the estimator under stability conditions. By leveraging statistical compression and a linear reward model assumption, it reduces standard error by approximately 40% while preserving unbiasedness and nominal confidence interval coverage. CRAM is compatible with widely used policies—including ε-greedy, Thompson sampling, and UCB—and demonstrates significant variance reduction, computational efficiency, and superior empirical performance over existing offline evaluation methods on both synthetic and real-world datasets.

Technology Category

Application Category

📝 Abstract

We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy evaluation methodology differs from most existing methods that focus on off-policy performance evaluation of contextual bandit algorithms. Cramming utilizes an entire bandit sequence through a single pass of data, leading to both statistically and computationally efficient evaluation. We prove that if a bandit algorithm satisfies a certain stability condition, the resulting crammed evaluation estimator is consistent and asymptotically normal under mild regularity conditions. Furthermore, we show that this stability condition holds for commonly used linear contextual bandit algorithms, including epsilon-greedy, Thompson Sampling, and Upper Confidence Bound algorithms. Using both synthetic and publicly available datasets, we compare the empirical performance of cramming with the state-of-the-art methods. The results demonstrate that the proposed cram method reduces the evaluation standard error by approximately 40% relative to off-policy evaluation methods while preserving unbiasedness and valid confidence interval coverage.

Problem

Research questions and friction points this paper is trying to address.

Evaluates final policy from contextual bandit algorithms

Uses on-policy data for statistical efficiency

Improves standard error by 40% vs off-policy methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy evaluation using bandit-generated data

Single-pass data utilization for efficiency

Stability condition ensures consistent estimators

🔎 Similar Papers

Diffusion Models Meet Contextual Bandits with Large Action Spaces