Active Learning for Stochastic Contextual Linear Bandits

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the inefficiency of conventional approaches in stochastic contextual linear bandits, which struggle to learn near-optimal policies due to passive context sampling. The paper presents the first systematic investigation of active context sampling, introducing a novel algorithm that leverages prior knowledge of the context distribution to actively select informative context–action pairs for reward observation. Theoretical analysis demonstrates that, under instance-dependent settings, the proposed method can surpass the classical minimax lower bound, achieving up to a √d improvement in sample complexity, where d denotes the feature dimensionality. Empirical evaluations on real-world tasks—including warfarin dosing and joke recommendation—confirm the algorithm’s effectiveness in substantially reducing the number of samples required to attain high performance.

📝 Abstract

A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios -- including online content recommendation, survey research, and clinical trials -- practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for active learning, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove instance-dependent theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of $\sqrt{d}$, where $d$ is the linear dimension. We show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.

Problem

Research questions and friction points this paper is trying to address.

active learning

stochastic contextual linear bandits

context sampling

near-optimal policy

sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

active learning

contextual linear bandits

strategic context sampling