Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

πŸ“… 2026-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of modeling nested causal structures in multi-level decision-making, where strategic and tactical actions reciprocally influence one anotherβ€”a setting poorly handled by conventional bandit and reinforcement learning approaches. The paper introduces Nested Causal Thompson Sampling (NCTS), a method that recursively executes hierarchical policies grounded in Structural Causal Models (SCMs) and leverages mechanism-factorized Bayesian posterior inference to enable off-policy, anytime risk certification. It establishes, for the first time, a PAC-Bayes excess risk bound for nested causal bandits, facilitating safe policy switching without online interaction and allowing independent, progressive deployment across decision levels. Empirical results demonstrate that NCTS significantly outperforms joint regression baselines in out-of-distribution transfer, policy performance, and tightness of risk bounds, thereby validating the efficacy of its progressive certification handover mechanism.
πŸ“ Abstract
Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.
Problem

Research questions and friction points this paper is trying to address.

Nested Causal Bandits
Causal Reinforcement Learning
PAC-Bayes
Safe Policy Deployment
Hierarchical Decision Making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nested Causal Bandits
PAC-Bayes Risk Certification
Mechanism-Factorised Posterior
Off-Policy Policy Evaluation
Progressive Certified Handover
πŸ”Ž Similar Papers
No similar papers found.