Beyond Softmax: A New Perspective on Gradient Bandits

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limitation of the Gradient Bandit algorithm—its reliance on the softmax assumption of action independence—which hinders modeling inter-action correlations. We propose a Generalized Gradient Bandit framework that integrates discrete choice theory, particularly the Generalized Nested Logit model, with online learning. Our method introduces a generalized gradient update rule and a closed-form sampling mechanism that explicitly capture action dependencies, supporting both stochastic and adversarial environments. The key contribution is the first incorporation of nested structure into gradient bandit algorithms, thereby relaxing the classical i.i.d. and independence assumptions inherent in softmax-based policies and enabling efficient cooperative learning among correlated actions. We establish a sublinear regret bound theoretically. Empirical results on stochastic multi-armed bandit tasks demonstrate significantly improved convergence speed and decision quality, while preserving modeling flexibility and computational efficiency.

Technology Category

Application Category

📝 Abstract
We establish a link between a class of discrete choice models and the theory of online learning and multi-armed bandits. Our contributions are: (i) sublinear regret bounds for a broad algorithmic family, encompassing Exp3 as a special case; (ii) a new class of adversarial bandit algorithms derived from generalized nested logit models citep{wen:2001}; and (iii) extcolor{black}{we introduce a novel class of generalized gradient bandit algorithms that extends beyond the widely used softmax formulation. By relaxing the restrictive independence assumptions inherent in softmax, our framework accommodates correlated learning dynamics across actions, thereby broadening the applicability of gradient bandit methods.} Overall, the proposed algorithms combine flexible model specification with computational efficiency via closed-form sampling probabilities. Numerical experiments in stochastic bandit settings demonstrate their practical effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Extends gradient bandits beyond softmax with correlated action dynamics
Establishes link between discrete choice models and online learning theory
Introduces new adversarial bandit algorithms from generalized nested logit models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized gradient bandits beyond softmax formulation
Closed-form sampling probabilities for computational efficiency
Correlated learning dynamics across actions accommodated
🔎 Similar Papers
No similar papers found.
E
Emerson Melo
Department of Economics, Indiana University Bloomington
David Müller
David Müller
Technische Hochschule Nürnberg Georg Simon Ohm