🤖 AI Summary
This study addresses the vulnerability of existing time series causal discovery methods, which often produce highly confident yet erroneous causal graphs when key assumptions—such as stationarity or regular sampling—are violated, and critically lack mechanisms to flag such unreliability. To remedy this, the work introduces the first systematic framework that formalizes assumption checking as a calibratable risk assessment procedure. By diagnosing effect sizes across five assumption categories, it aggregates four risk scores with quantified uncertainty intervals and incorporates an abstention-aware decision strategy that proactively refrains from inference or recommends alternative methods when reliability is low. Experiments demonstrate strong calibration (AUROC > 0.95 on synthetic data), a 62% reduction in false positives among recommendations, and a 78% abstention rate for severely violated cases. Across 21 external benchmarks from TimeGraph and CausalTime, the method’s decisions align perfectly with expert annotations.
📝 Abstract
Time-series causal discovery methods rely on assumptions such as stationarity, regular sampling, and bounded temporal dependence. When these assumptions are violated, structure learning can produce confident but misleading causal graphs without warning. We introduce Causal-Audit, a framework that formalizes assumption validation as calibrated risk assessment. The framework computes effect-size diagnostics across five assumption families (stationarity, irregularity, persistence, nonlinearity, and confounding proxies), aggregates them into four calibrated risk scores with uncertainty intervals, and applies an abstention-aware decision policy that recommends methods (e.g., PCMCI+, VAR-based Granger causality) only when evidence supports reliable inference. The semi-automatic diagnostic stage can also be used independently for structured assumption auditing in individual studies. Evaluation on a synthetic atlas of 500 data-generating processes (DGPs) spanning 10 violation families demonstrates well-calibrated risk scores (AUROC > 0.95), a 62% false positive reduction among recommended datasets, and 78% abstention on severe-violation cases. On 21 external evaluations from TimeGraph (18 categories) and CausalTime (3 domains), recommend-or-abstain decisions are consistent with benchmark specifications in all cases. An open-source implementation of our framework is available.