π€ AI Summary
In mQTL studies, bisulfite sequencing induces substantial missingness in the response variable (DNA methylation levels), leading to biased multivariate regression inference. To address this, we propose a plug-in-free convex optimization estimation framework. Methodologically, we develop a three-stage unbiased estimation procedure that jointly learns regression coefficients and the conditional dependency structure among responses, integrating sparse regression with precision matrix estimation to simultaneously achieve variable selection and network inference. Our key contribution is circumventing imputation-induced bias by directly constructing an unbiased surrogate estimator under the missingness mechanism. In simulations and analyses of the CARTaGENE cohort, our method significantly improves prediction accuracy and sparse signal recovery, effectively controls the false discovery rate, successfully replicates known genetic associations, and identifies novel mQTL signals.
π Abstract
Identifying genetic regulators of DNA methylation (mQTLs) with multivariate models enhances statistical power, but is challenged by missing data from bisulfite sequencing. Standard imputation-based methods can introduce bias, limiting reliable inference. We propose exttt{missoNet}, a novel convex estimation framework that jointly estimates regression coefficients and the precision matrix from data with missing responses. By using unbiased surrogate estimators, our three-stage procedure avoids imputation while simultaneously performing variable selection and learning the conditional dependence structure among responses. We establish theoretical error bounds, and our simulations demonstrate that exttt{missoNet} consistently outperforms existing methods in both prediction and sparsity recovery. In a real-world mQTL analysis of the CARTaGENE cohort, exttt{missoNet} achieved superior predictive accuracy and false-discovery control on a held-out validation set, identifying known and credible novel genetic associations. The method offers a robust, efficient, and theoretically grounded tool for genomic analyses, and is available as an R package.