๐ค AI Summary
Supervised gene regulatory network (GRN) inference methods rely on costly, imbalanced, and gene-biased ground-truth (GT) labels that poorly reflect true biological regulation. To address this, we propose the first unsupervised generative GRN inference framework: it incorporates biologically grounded text embeddings (e.g., PubMedBERT) as semantic priors and integrates heterogeneous biological knowledge; jointly models gene expression and regulatory topology via a variational autoencoder coupled with differentiable graph learning; and introduces a biology-driven evaluation system aligned with downstream tasks (e.g., biomarker discovery) to expose implicit biases in supervised approaches. On four benchmark datasets, our method achieves an average 38.5% performance gain over state-of-the-art supervised baselines; incorporating weak label priors further improves performance by 11.1%; and critically mitigates both label bias and class imbalanceโkey limitations of GT-dependent methods.
๐ Abstract
Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases, such as class imbalances of GT interactions, rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.