Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work investigates whether preference data generated by large language models (LLMs) can effectively warm-start contextual bandit algorithms in the presence of noise or systematic bias. Through theoretical regret bound analysis and extensive experiments across multiple datasets, the study provides the first systematic quantification of LLM priors’ efficacy under varying noise levels and alignment conditions. It establishes provable sufficient conditions under which warm-starting outperforms cold-starting and introduces an alignment score estimator to predict applicability. Results demonstrate that in well-aligned domains, warm-starting tolerates up to 30% random noise without performance degradation; however, under systematic misalignment—even in the absence of noise—it underperforms cold-starting. Crucially, the proposed alignment score accurately predicts when LLM-derived priors will be beneficial.

Technology Category

Application Category

📝 Abstract

The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

Problem

Research questions and friction points this paper is trying to address.

LLM-initialized bandits

warm-start

preference alignment

label noise

systematic misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-initialized bandits

preference alignment

label noise robustness