🤖 AI Summary
Conventional multi-armed bandit algorithms fail under network interference due to their reliance on independence assumptions; existing approaches require fixed network topologies and scale poorly beyond tens of nodes. Method: We propose the first scalable Thompson sampling algorithm that accommodates dynamic network inputs. Under a local interference causal assumption, we prove that rewards admit a linear decomposition, enabling integration of Bayesian posterior updates with linear reward modeling. Contribution/Results: We establish a Bayesian regret bound of $O(sqrt{nT})$, guaranteeing sublinear convergence. Empirical evaluation demonstrates efficient learning in networks with up to 100 nodes, with policy rewards significantly outperforming state-of-the-art baselines. Our approach overcomes dual bottlenecks—sample inefficiency and limited adaptability to network dynamics—enabling robust, scalable decision-making in time-varying networked environments.
📝 Abstract
Many interventions, such as vaccines in clinical trials or coupons in online marketplaces, must be assigned sequentially without full knowledge of their effects. Multi-armed bandit algorithms have proven successful in such settings. However, standard independence assumptions fail when the treatment status of one individual impacts the outcomes of others, a phenomenon known as interference. We study optimal-policy learning under interference on a dynamic network. Existing approaches to this problem require repeated observations of the same fixed network and struggle to scale in sample size beyond as few as fifteen connected units -- both limit applications. We show that under common assumptions on the structure of interference, rewards become linear. This enables us to develop a scalable Thompson sampling algorithm that maximizes policy impact when a new $n$-node network is observed each round. We prove a Bayesian regret bound that is sublinear in $n$ and the number of rounds. Simulation experiments show that our algorithm learns quickly and outperforms existing methods. The results close a key scalability gap between causal inference methods for interference and practical bandit algorithms, enabling policy optimization in large-scale networked systems.