🤖 AI Summary
Accurately identifying the causal gene—the root cause variable—at the origin of the pathogenic cascade in monogenic disorders remains challenging when the underlying causal order is unknown.
Method: We address this under a linear structural equation model (SEM) with unknown causal ordering, leveraging only one interventional dataset and multiple observational datasets. Our approach introduces a theoretical framework grounded in permutation invariance and Cholesky decomposition, enabling robust estimation via permutation testing and high-dimensional statistical inference.
Contribution/Results: We establish, for the first time without auxiliary assumptions (e.g., non-Gaussian noise, specific functional forms, or sparsity), the strict identifiability of the root cause. This overcomes the fundamental limitation of conventional methods that fail under unknown causal order and extends naturally to high-dimensional settings. Extensive simulations confirm its robustness and statistical power. On real gene expression data, our method successfully recovers multiple known disease-causing genes, demonstrating strong biological validity.
📝 Abstract
This work is motivated by the following problem: Can we identify the disease-causing gene in a patient affected by a monogenic disorder? This problem is an instance of root cause discovery. In particular, we aim to identify the intervened variable in one interventional sample using a set of observational samples as reference. We consider a linear structural equation model where the causal ordering is unknown. We begin by examining a simple method that uses squared z-scores and characterize the conditions under which this method succeeds and fails, showing that it generally cannot identify the root cause. We then prove, without additional assumptions, that the root cause is identifiable even if the causal ordering is not. Two key ingredients of this identifiability result are the use of permutations and the Cholesky decomposition, which allow us to exploit an invariant property across different permutations to discover the root cause. Furthermore, we characterize permutations that yield the correct root cause and, based on this, propose a valid method for root cause discovery. We also adapt this approach to high-dimensional settings. Finally, we evaluate the performance of our methods through simulations and apply the high-dimensional method to discover disease-causing genes in the gene expression dataset that motivates this work.