Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This paper studies the Logistic Contextual Slate Bandit problem: in each round, an agent selects an $N$-item slate from an exponentially large candidate set, receives binary feedback governed by a logistic model, and aims to maximize cumulative reward over $T$ rounds while bounding per-round computational cost. We propose Slate-GLM-OFU and Slate-GLM-TS—novel algorithms that jointly optimize local slot-level decisions (with $O(N)$ per-round complexity) and global parameter learning, achieving $ ilde{O}(sqrt{T})$ regret. Leveraging generalized linear models, our approach integrates optimistic confidence bounds and Thompson sampling, circumventing the exponential cost of traditional global combinatorial optimization. Extensive experiments on multiple synthetic benchmarks demonstrate significant improvements over state-of-the-art baselines in both cumulative regret and runtime efficiency. We further validate our methods on real-world applications—large language model prompt example selection and sentiment analysis—achieving competitive accuracy while maintaining scalability and low computational overhead.

Technology Category

Application Category

📝 Abstract

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{Omega(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $ ilde{O}(sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient algorithms for logistic contextual slate bandits

Maximize cumulative reward with low per-round computational costs

Apply algorithms to improve language model example selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local planning for efficient slot selections

Global learning for joint parameter estimation

Logistic model for binary reward prediction

🔎 Similar Papers

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit