🤖 AI Summary
This work addresses the challenge of efficient learning in contextual bandits with large action spaces by introducing the OE2D framework, which reformulates online decision-making as an offline regression problem. It dynamically constructs action distributions via an “exploitation-based F-design” to effectively balance exploration and exploitation. The key innovation lies in the introduction of the Decision-Offline Estimation Coefficient (DOEC), a novel complexity measure that, for the first time, establishes a theoretical connection to the Decision-Estimation Coefficient (DEC), thereby unifying design principles for both offline and online oracle-efficient algorithms. Leveraging techniques such as the Eluder dimension, the proposed method achieves near-optimal regret bounds with only $O(\log T)$ calls to an offline regression oracle—or $O(\log \log T)$ when the horizon $T$ is known—demonstrating remarkable computational efficiency.
📝 Abstract
We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(log(T))$ calls to an offline regression oracle over $T$ rounds, and makes $O(loglog(T))$ calls when $T$ is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design''that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.