Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Classical multi-armed bandit algorithms (e.g., Thompson Sampling, UCB) rely on distributional assumptions or asymptotic guarantees, neglecting finite-sample statistical reliability and suffering suboptimal regret in small-gap regimes—where reward differences are minimal. Method: We introduce conformal prediction (CP) into the bandit framework for the first time, proposing a novel paradigm that jointly ensures statistical coverage guarantees and regret minimization. Our approach achieves nominal coverage (e.g., 90%) with finite-sample validity while attaining optimal $O(sqrt{T})$ small-gap regret. To enhance temporal adaptability—particularly in financial applications—we integrate a hidden Markov model to capture regime shifts. Contribution/Results: Theoretical analysis, simulations, and portfolio empirical studies demonstrate that our method significantly reduces small-gap regret, strictly maintains pre-specified coverage, and preserves risk-adjusted return efficiency without compromise.

Technology Category

Application Category

📝 Abstract

We introduce Conformal Bandits, a novel framework integrating Conformal Prediction (CP) into bandit problems, a classic paradigm for sequential decision-making under uncertainty. Traditional regret-minimisation bandit strategies like Thompson Sampling and Upper Confidence Bound (UCB) typically rely on distributional assumptions or asymptotic guarantees; further, they remain largely focused on regret, neglecting their statistical properties. We address this gap. Through the adoption of CP, we bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage. We demonstrate the potential of it Conformal Bandits through simulation studies and an application to portfolio allocation, a typical small-gap regime, where differences in arm rewards are far too small for classical policies to achieve optimal regret bounds in finite sample. Motivated by this, we showcase our framework's practical advantage in terms of regret in small-gap settings, as well as its added value in achieving nominal coverage guarantees where classical UCB policies fail. Focusing on our application of interest, we further illustrate how integrating hidden Markov models to capture the regime-switching behaviour of financial markets, enhances the exploration-exploitation trade-off, and translates into higher risk-adjusted regret efficiency returns, while preserving coverage guarantees.

Problem

Research questions and friction points this paper is trying to address.

Integrates Conformal Prediction into bandit problems for statistical guarantees

Addresses small-gap regimes where classical policies fail to achieve optimal regret

Enhances exploration-exploitation trade-off in portfolio allocation with coverage guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Conformal Prediction into bandit problems

Provides finite-time statistical guarantees with coverage

Uses hidden Markov models for regime-switching markets

🔎 Similar Papers

No similar papers found.