🤖 AI Summary
Traditional causal discovery methods become computationally intractable in high-dimensional settings (up to 10⁴ variables) due to exponential explosion of the search space—specifically, the O(2ⁿ) complexity barrier.
Method: This paper proposes a hyperstructure-guided causal graph partitioning framework that decomposes global causal search into parallelizable subproblems, enabling divide-and-conquer optimization with theoretical guarantees for recovering the true causal graph’s Markov equivalence class.
Contribution/Results: To our knowledge, this is the first method to achieve provably scalable high-dimensional causal discovery under rigorous theoretical guarantees. On biological synthetic networks, it matches state-of-the-art accuracy while significantly accelerating runtime. It successfully infers a genome-wide gene regulatory network comprising ~10,000 genes, demonstrating both scalability and practical utility in real-world high-dimensional applications.
📝 Abstract
The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.