Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficient learning in contextual bandits with large action spaces by introducing the OE2D framework, which reformulates online decision-making as an offline regression problem. It dynamically constructs action distributions via an “exploitation-based F-design” to effectively balance exploration and exploitation. The key innovation lies in the introduction of the Decision-Offline Estimation Coefficient (DOEC), a novel complexity measure that, for the first time, establishes a theoretical connection to the Decision-Estimation Coefficient (DEC), thereby unifying design principles for both offline and online oracle-efficient algorithms. Leveraging techniques such as the Eluder dimension, the proposed method achieves near-optimal regret bounds with only $O(\log T)$ calls to an offline regression oracle—or $O(\log \log T)$ when the horizon $T$ is known—demonstrating remarkable computational efficiency.

Technology Category

Application Category

📝 Abstract
We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(log(T))$ calls to an offline regression oracle over $T$ rounds, and makes $O(loglog(T))$ calls when $T$ is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design''that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.
Problem

Research questions and friction points this paper is trying to address.

contextual bandits
offline regression
regret minimization
exploration-exploitation tradeoff
reward function approximation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline Regression Oracle
Contextual Bandits
Decision-Offline Estimation Coefficient
Exploitative F-design
Regret Minimization
🔎 Similar Papers
No similar papers found.