Sample, estimate, aggregate: A recipe for causal discovery foundation models

📅 2024-02-02

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 2

career value

237K/year

🤖 AI Summary

Causal discovery in small-sample, high-dimensional biological data—such as single-cell transcriptomics (thousands of variables, tens of cells per intervention)—suffers from fragility due to model misspecification and distributional shift. Method: We propose the first foundation model for causal discovery, pretrained on synthetic data to predict global causal graphs from local causal cues and subgraph-level statistics (e.g., inverse covariance). The architecture integrates supervised learning, stochastic subgraph sampling, statistical feature extraction, and graph-level aggregation. Contribution/Results: We theoretically prove consistent aggregation of subgraph-level causal structures. The model exhibits strong robustness to model misspecification and distributional shift. It infers causal graphs over hundreds of nodes in seconds and demonstrates superior cross-dataset generalization—outperforming conventional algorithms significantly on both biological and synthetic benchmarks.

Technology Category

Application Category

📝 Abstract

Causal discovery, the task of inferring causal structure from data, has the potential to uncover mechanistic insights from biological experiments, especially those involving perturbations. However, causal discovery algorithms over larger sets of variables tend to be brittle against misspecification or when data are limited. For example, single-cell transcriptomics measures thousands of genes, but the nature of their relationships is not known, and there may be as few as tens of cells per intervention setting. To mitigate these challenges, we propose a foundation model-inspired approach: a supervised model trained on large-scale, synthetic data to predict causal graphs from summary statistics -- like the outputs of classical causal discovery algorithms run over subsets of variables and other statistical hints like inverse covariance. Our approach is enabled by the observation that typical errors in the outputs of a discovery algorithm remain comparable across datasets. Theoretically, we show that the model architecture is well-specified, in the sense that it can recover a causal graph consistent with graphs over subsets. Empirically, we train the model to be robust to misspecification and distribution shift using diverse datasets. Experiments on biological and synthetic data confirm that this model generalizes well beyond its training set, runs on graphs with hundreds of variables in seconds, and can be easily adapted to different underlying data assumptions.

Problem

Research questions and friction points this paper is trying to address.

Inferring causal structure from limited biological data

Improving robustness in large-scale causal discovery

Adapting foundation models for diverse data assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised model trained on synthetic data

Predicts causal graphs from summary statistics

Robust to misspecification and distribution shift

🔎 Similar Papers

No similar papers found.