🤖 AI Summary
This paper studies the causal Bandit problem under unknown causal graphs: maximizing long-term reward via sequential interventions when both the causal topology and intervention distributions are unknown and potentially non-stationary. We propose a novel online learning framework comprising: (1) an asymmetric graph error control mechanism that separately bounds false positives and false negatives; (2) a joint subgraph-level causal discovery and change detection strategy, substantially reducing sample complexity; and (3) an integrated pipeline combining least-squares weight estimation, problem-specific uncertainty quantification, UCB-style intervention selection, and online abrupt change detection. Evaluated on 100 randomly generated causal Bandit instances, our method reduces average sample complexity by 52% and increases cumulative reward by 85% over state-of-the-art baselines—demonstrating consistent superiority in both stationary and non-stationary environments.
📝 Abstract
In this paper, the causal bandit problem is investigated, with the objective of maximizing the long-term reward by selecting an optimal sequence of interventions on nodes in an unknown causal graph. It is assumed that both the causal topology and the distribution of interventions are unknown. First, based on the difference between the two types of graph identification errors (false positives and negatives), a causal graph learning method is proposed. Numerical results suggest that this method has a much lower sample complexity relative to the prior art by learning sub-graphs. However, we note that a sample complexity analysis for the new algorithm has not been undertaken, as of yet. Under the assumption of minimum-mean squared error weight estimation, a new uncertainty bound tailored to the causal bandit problem is derived. This uncertainty bound drives an upper confidence bound-based intervention selection to optimize the reward. Further, we consider a particular instance of non-stationary bandits wherein both the causal topology and interventional distributions can change. Our solution is the design of a sub-graph change detection mechanism that requires a modest number of samples. Numerical results compare the new methodology to existing schemes and show a substantial performance improvement in stationary and non-stationary settings. Averaged over 100 randomly generated causal bandits, the proposed scheme takes significantly fewer samples to learn the causal structure and achieves a reward gain of 85% compared to existing approaches.