Self-supervised learning on gene expression data

📅 2025-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck of supervised learning—its reliance on large-scale labeled samples—in phenotypic prediction from gene expression data, this study proposes a novel self-supervised representation learning paradigm. We systematically adapt three state-of-the-art self-supervised learning strategies—contrastive learning, masked reconstruction, and generative modeling—to bulk RNA-Seq data for the first time, conducting pretraining and downstream fine-tuning across multiple public datasets. Experimental results demonstrate that our approach significantly outperforms fully supervised baselines in phenotypic prediction accuracy; notably, it achieves comparable performance using only 10% of labeled data. Each method exhibits distinct strengths: contrastive learning delivers superior generalizability, masked reconstruction shows robustness to sparse signals, and generative modeling better supports multi-omics integration. This work establishes a reproducible methodological framework and practical guidelines for label-efficient analysis of genomic data.

Technology Category

Application Category

📝 Abstract
Predicting phenotypes from gene expression data is a crucial task in biomedical research, enabling insights into disease mechanisms, drug responses, and personalized medicine. Traditional machine learning and deep learning rely on supervised learning, which requires large quantities of labeled data that are costly and time-consuming to obtain in the case of gene expression data. Self-supervised learning has recently emerged as a promising approach to overcome these limitations by extracting information directly from the structure of unlabeled data. In this study, we investigate the application of state-of-the-art self-supervised learning methods to bulk gene expression data for phenotype prediction. We selected three self-supervised methods, based on different approaches, to assess their ability to exploit the inherent structure of the data and to generate qualitative representations which can be used for downstream predictive tasks. By using several publicly available gene expression datasets, we demonstrate how the selected methods can effectively capture complex information and improve phenotype prediction accuracy. The results obtained show that self-supervised learning methods can outperform traditional supervised models besides offering significant advantage by reducing the dependency on annotated data. We provide a comprehensive analysis of the performance of each method by highlighting their strengths and limitations. We also provide recommendations for using these methods depending on the case under study. Finally, we outline future research directions to enhance the application of self-supervised learning in the field of gene expression data analysis. This study is the first work that deals with bulk RNA-Seq data and self-supervised learning.
Problem

Research questions and friction points this paper is trying to address.

Predicting phenotypes from gene expression data efficiently
Reducing dependency on labeled data with self-supervised learning
Improving accuracy in biomedical research using bulk RNA-Seq data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning for gene expression data
Exploiting unlabeled data structure for phenotype prediction
Reducing dependency on annotated data
🔎 Similar Papers
No similar papers found.
K
Kevin Dradjat
IBISC Laboratory, University Paris-Saclay (Univ. Evry), France and ADLIN, Paris, France
M
Massinissa Hamidi
IBISC Laboratory, University Paris-Saclay (Univ. Evry), France
P
Pierre Bartet
ADLIN, Paris, France
Blaise Hanczar
Blaise Hanczar
Professor, Université Paris-Saclay (Univ. Evry)
Machine LearningBioinformatics