🤖 AI Summary
This paper studies regret minimization in causal bandits under unknown causal structure but assuming causal sufficiency. Contrary to existing approaches—which rely either on parent-node identification or joint causal graph learning—we first reveal a fundamental tension: accurate identification of the reward variable’s parents is inherently incompatible with regret minimization, and full causal graph recovery is unnecessary. To address this, we propose a novel algorithm that bypasses both causal graph reconstruction and parent-set estimation, instead directly exploiting the combinatorial structure of the action space for decision-making. We derive a tight information-theoretic regret lower bound and prove that our algorithm achieves near-optimal regret in both settings—when the number of parents is known or unknown. Empirical evaluations across diverse environments demonstrate substantial improvements over state-of-the-art baselines. Our work establishes a new paradigm for causal reinforcement learning that decouples regret minimization from causal discovery.
📝 Abstract
We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward's parents and then applying classic bandit methods to them, or jointly learning the parents while minimizing regret. We investigate whether such strategies are optimal. Somewhat counterintuitively, our results show that learning the parent set is suboptimal. We do so by proving that there exist instances where regret minimization and parent identification are fundamentally conflicting objectives. We further analyze both the known and unknown parent set size regimes, establish novel regret lower bounds that capture the combinatorial structure of the action space. Building on these insights, we propose nearly optimal algorithms that bypass graph and parent recovery, demonstrating that parent identification is indeed unnecessary for regret minimization. Experiments confirm that there exists a large performance gap between our method and existing baselines in various environments.