🤖 AI Summary
This paper investigates optimal integration of structurally heterogeneous expert data in Bayesian multi-armed bandits: (i) offline pretraining—leveraging historical data from expert-optimal policies—and (ii) online collaborative learning—dynamically selecting actions based on either agent’s own experience or real-time expert feedback to update beliefs. Methodologically, it introduces an information-theoretic framework modeling how expert data shapes posterior beliefs, and proposes, for the first time, a mutual-information-driven mechanism for dynamic data-source selection. It further derives an information-aware regret bound. Theoretically, offline pretraining is proven to strictly tighten the regret upper bound. Empirically, the approach adaptively assesses expert reliability and achieves significant improvements in both sample efficiency and robustness over baseline methods.
📝 Abstract
Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.