Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests

πŸ“… 2025-10-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scalability limitations of causal graph learning for high-dimensional variables (hundreds) and large-scale samples (thousands) under nonlinear, continuous, or mixed data, this paper introduces two novel tools: the BF-BIC score and the BF-LRT conditional independence testβ€”both based on truncated basis function expansions. We pioneer the integration of truncated additive models with invertible reparameterization to enable robust causal discovery under post-nonlinear models. Discrete variables are uniformly handled via degenerate Gaussian embeddings, enabling an efficient hybrid search framework. Theoretical analysis guarantees consistency, while computational complexity is substantially reduced. Empirical evaluation on neural causal simulations demonstrates superior accuracy and efficiency over state-of-the-art methods including KCI and RFCI. The approach is successfully applied to Canadian wildfire risk modeling, validating its practical utility in real-world complex systems.

Technology Category

Application Category

πŸ“ Abstract
Learning graphical conditional independence structures from nonlinear, continuous or mixed data is a central challenge in machine learning and the sciences, and many existing methods struggle to scale to thousands of samples or hundreds of variables. We introduce two basis-expansion tools for scalable causal discovery. First, the Basis Function BIC (BF-BIC) score uses truncated additive expansions to approximate nonlinear dependencies. BF-BIC is theoretically consistent under additive models and extends to post-nonlinear (PNL) models via an invertible reparameterization. It remains robust under moderate interactions and supports mixed data through a degenerate-Gaussian embedding for discrete variables. In simulations with fully nonlinear neural causal models (NCMs), BF-BIC outperforms kernel- and constraint-based methods (e.g., KCI, RFCI) in both accuracy and runtime. Second, the Basis Function Likelihood Ratio Test (BF-LRT) provides an approximate conditional independence test that is substantially faster than kernel tests while retaining competitive accuracy. Extensive simulations and a real-data application to Canadian wildfire risk show that, when integrated into hybrid searches, BF-based methods enable interpretable and scalable causal discovery. Implementations are available in Python, R, and Java.
Problem

Research questions and friction points this paper is trying to address.

Scalable causal discovery from nonlinear mixed data with thousands of samples
Learning graphical conditional independence structures from continuous or mixed variables
Developing efficient methods that outperform kernel-based approaches in accuracy and runtime
Innovation

Methods, ideas, or system contributions that make the work stand out.

BF-BIC score uses truncated expansions for nonlinear dependencies
BF-LRT test provides fast approximate conditional independence
Methods support mixed data via degenerate-Gaussian embedding