🤖 AI Summary
Existing “warm-start” approaches for bandit problems with historical data suffer from low data efficiency due to spurious correlations and coverage bias—especially severe in continuous action spaces.
Method: We propose ArtificialReplay, a meta-algorithm that synergistically leverages historical data and online interaction without modifying the base bandit algorithm. We introduce the notion of *Independent-of-Data (IIData)*—a formal condition on historical data—and design a lightweight replay mechanism combining importance sampling and counterfactual reweighting, enabling theoretically grounded warm-starting with minimal historical samples.
Contribution/Results: ArtificialReplay is agnostic to the underlying bandit algorithm and supports both discrete and continuous action spaces. Experiments demonstrate substantial improvements in data efficiency across K-armed and continuous combinatorial bandit tasks. It further validates practical efficacy in real-world anti-poaching security modeling. Notably, it exhibits robust performance gains even when the IIData condition is violated.
📝 Abstract
Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to"warm start"bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.