🤖 AI Summary
To address the challenges of low interpretability and suboptimal accuracy in disease binary classification arising from high-dimensional, heterogeneous multi-omics data (DNA methylation, mRNA, miRNA), complex intra- and inter-omic interactions, and severe class imbalance, this paper proposes an interpretable graph neural network (GNN) framework. Methodologically, it (i) introduces the first XGBoost-driven supervised learning approach to construct modality-specific sparse graphs that explicitly encode biological relationships within and across omics layers; (ii) designs a hybrid architecture integrating hierarchical GNNs and deep feedforward networks to enable disentangled feature learning and cross-omic fusion; and (iii) inherently supports identification of key biomarkers and quantification of per-omic contribution. Evaluated on three real-world disease datasets, the method achieves 5–10% improvements in ROC-AUC, accuracy, and F1-score over state-of-the-art methods, reaching an F1-score of 87.2% under extreme class imbalance—demonstrating superior performance, robustness, and biological interpretability.
📝 Abstract
Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality and complex interactions among omics layers present major challenges for predictive modeling. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) to perform omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. On three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance (e.g., 87.2% vs. 33.4% F1 on imbalanced data). The model maintains computational efficiency through sparse graphs (2.1-2.8 edges per node) and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight MOTGNN's potential to improve both predictive accuracy and interpretability in multi-omics disease modeling.