🤖 AI Summary
This work addresses the parameter mismatch problem arising from offline data bias in offline-to-online learning by proposing the Ellipsoidal-MINUCB algorithm. The method integrates a standard online learning branch with an offline-guided branch, selectively leveraging offline information only when it effectively reduces uncertainty. It introduces a geometry-aware ellipsoidal confidence region to replace the conventional isotropic radius. Key innovations include a regret bound that disentangles statistical width from transfer bias, a dynamic transfer certificate mechanism based on directional transfer modeling, and an exploration strategy combining ridge regression with SupLinUCB-style design. The algorithm enjoys high-probability regret guarantees, and experiments demonstrate its significant superiority over baselines—along with strong safety and efficacy—within moderate time horizons where offline coverage aligns well with transferability.
📝 Abstract
We study offline-to-online learning in linear contextual bandits with biased offline regression data: the offline parameter need not match the online one, so history should not be treated as a single warm start. We model directional transfer with a shift certificate $(M_{\mathrm{shift}},ρ)$ and offline ridge estimation, yielding a geometry-aware confidence region for the online parameter rather than an isotropic radius. We propose \emph{Ellipsoidal-MINUCB}, which combines a standard online branch with an offline-informed pooled branch and uses offline information only when it tightens uncertainty. With high probability, regret is bounded by the minimum of a standard SupLinUCB-style fallback and a pooled term that separates statistical width from a certificate-weighted shift penalty. Under a simple alignment condition, the pooled term further simplifies to a rate governed by an effective dimension induced by the offline geometry. We also show that a purely Euclidean (scalar) shift bound, by itself, does not determine which feature directions are transferable. Beyond this fixed certificate, we show how to learn a data-driven certificate from data at finitely many refresh times and establish a high-probability regret bound for Ellipsoidal-MINUCB with epoch-wise learned certificates. Experiments match the main prediction: gains are strongest at intermediate horizons when offline coverage and transferability align, while the method otherwise tracks the safe online baseline.