🤖 AI Summary
To address the challenge of balancing statistical consistency and computational scalability in partial correlation network inference from high-dimensional clinical multi-omics data, this paper proposes a scalable precision matrix estimation framework. Methodologically, it (1) introduces a pseudolikelihood reparameterization paradigm to preserve sparse structure invariance; (2) designs a novel ℓ₁-regularized loss function, enabling theoretically consistent network learning with provable convergence rates even for problems involving millions of variables; and (3) integrates operator-splitting optimization with communication-avoiding distributed matrix multiplication to support high-performance parallel computation. Empirically, the method robustly recovers ground-truth biological networks in million-variable simulations and accurately identifies key transcription factors and their co-activators in hepatocellular carcinoma dual-omics data, achieving significantly higher specificity than state-of-the-art methods.
📝 Abstract
Graphical model estimation from modern multi-omics data requires a balance between statistical estimation performance and computational scalability. We introduce a novel pseudolikelihood-based graphical model framework that reparameterizes the target precision matrix while preserving sparsity pattern and estimates it by minimizing an $ell_1$-penalized empirical risk based on a new loss function. The proposed estimator maintains estimation and selection consistency in various metrics under high-dimensional assumptions. The associated optimization problem allows for a provably fast computation algorithm using a novel operator-splitting approach and communication-avoiding distributed matrix multiplication. A high-performance computing implementation of our framework was tested in simulated data with up to one million variables demonstrating complex dependency structures akin to biological networks. Leveraging this scalability, we estimated partial correlation network from a dual-omic liver cancer data set. The co-expression network estimated from the ultrahigh-dimensional data showed superior specificity in prioritizing key transcription factors and co-activators by excluding the impact of epigenomic regulation, demonstrating the value of computational scalability in multi-omic data analysis. %derived from the gene expression data.