🤖 AI Summary
This work addresses the lack of benchmarks for causal reinforcement learning in complex systems characterized by partial observability, large masked action spaces, and explicit causal structures. We propose MTG-Causal-RL, the first causal RL benchmark based on Magic: The Gathering, integrating multiple deck strategies, reward mechanisms, and a handcrafted structural causal model (SCM) over strategic variables. To leverage this benchmark, we introduce Causal Graph Factorized Advantage PPO (CGFA-PPO), which incorporates an intervention calibration loss and a factor-aligned critic objective within the Gymnasium framework, enabling causal credit assignment, cross-deck transfer, and auditability of policies. Experiments demonstrate that CGFA-PPO achieves higher in-distribution win rates than baselines, while factor-calibrated trajectories and leave-one-out transfer gaps reveal structural diagnostic insights beyond scalar performance metrics.
📝 Abstract
Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077-dimensional partial observation, a 478-action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand-specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM-predicted intervention effects, and per-factor credit traces, making causal credit assignment, leave-one-out cross-archetype transfer, and policy auditability first-class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal-world-model PPO variant, and an architecture-matched scalar control. We propose Causal Graph-Factored Advantage PPO (CGFA-PPO) as a reference causal agent that uses SCM parents of win probability as factor-aligned critic targets with an intervention-calibration loss. All comparisons use paired seeds, paired-bootstrap confidence intervals, and Holm-Bonferroni correction within pre-registered families. Masked PPO and CGFA-PPO reach competitive in-distribution win rates and exceed the random baseline; per-factor calibration trajectories and leave-one-out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference-baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG-Causal-RL gives causal-RL, world-model, and LLM-agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM-grounded policy auditability.