Multivariate regression with missing response data for modelling regional DNA methylation QTLs

πŸ“… 2025-07-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In mQTL studies, bisulfite sequencing induces substantial missingness in the response variable (DNA methylation levels), leading to biased multivariate regression inference. To address this, we propose a plug-in-free convex optimization estimation framework. Methodologically, we develop a three-stage unbiased estimation procedure that jointly learns regression coefficients and the conditional dependency structure among responses, integrating sparse regression with precision matrix estimation to simultaneously achieve variable selection and network inference. Our key contribution is circumventing imputation-induced bias by directly constructing an unbiased surrogate estimator under the missingness mechanism. In simulations and analyses of the CARTaGENE cohort, our method significantly improves prediction accuracy and sparse signal recovery, effectively controls the false discovery rate, successfully replicates known genetic associations, and identifies novel mQTL signals.

Technology Category

Application Category

πŸ“ Abstract
Identifying genetic regulators of DNA methylation (mQTLs) with multivariate models enhances statistical power, but is challenged by missing data from bisulfite sequencing. Standard imputation-based methods can introduce bias, limiting reliable inference. We propose exttt{missoNet}, a novel convex estimation framework that jointly estimates regression coefficients and the precision matrix from data with missing responses. By using unbiased surrogate estimators, our three-stage procedure avoids imputation while simultaneously performing variable selection and learning the conditional dependence structure among responses. We establish theoretical error bounds, and our simulations demonstrate that exttt{missoNet} consistently outperforms existing methods in both prediction and sparsity recovery. In a real-world mQTL analysis of the CARTaGENE cohort, exttt{missoNet} achieved superior predictive accuracy and false-discovery control on a held-out validation set, identifying known and credible novel genetic associations. The method offers a robust, efficient, and theoretically grounded tool for genomic analyses, and is available as an R package.
Problem

Research questions and friction points this paper is trying to address.

Handling missing response data in multivariate DNA methylation QTL modeling
Reducing bias in mQTL analysis without imputation methods
Improving prediction and sparsity recovery in genomic association studies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convex framework estimates coefficients and precision matrix
Unbiased surrogate estimators avoid imputation bias
Three-stage procedure enables variable selection and dependency learning
Shomoita Alam
Shomoita Alam
PhD Statistics (McGill), Postdoctoral Research Fellow (Fred Hutch), Postdoctoral Researcher (McGill)
Causal InferenceGraphical ModelsHigh-dimensional DataMachine Learning
Yixiao Zeng
Yixiao Zeng
Carnegie Mellon University
Natural Language ProcessingLarge Languange Model
S
Sasha Bernatsky
Department of Medicine, McGill University, The Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada
M
Marie Hudson
Lady Davis Institute for Medical Research, Jewish General Hospital, Department of Medicine, McGill University, Montreal, Quebec, Canada
I
InΓ©s Colmegna
Department of Medicine, McGill University, The Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada
David A. Stephens
David A. Stephens
Professor, Department of Mathematics and Statistics, McGill University
Statistics
C
Celia M. T. Greenwood
Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
Archer Y. Yang
Archer Y. Yang
Department of Mathematics and Statistics, McGill University
Statistical machine learninguncertainty quantificationcomputational statistics