Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Gene expression data are inherently high-dimensional and sparse, with pervasive missing values, severely compromising modeling robustness. Method: We propose GexBERT—the first Transformer-based autoencoder framework tailored for transcriptomic data—featuring masked reconstruction pretraining to learn context-aware gene embeddings, integration of gene co-expression priors to enhance biological plausibility, and attention mechanisms enabling cross-cancer interpretable pattern discovery. Contributions/Results: On pan-cancer classification, GexBERT achieves state-of-the-art accuracy using only a small subset of key genes. For survival prediction, it delivers significant performance gains, particularly attributable to accurate imputation of anchor gene expression. Under high missingness rates, its imputation accuracy substantially outperforms conventional methods (e.g., KNN, MICE, and deep autoencoders), demonstrating superior robustness and biological fidelity.

Technology Category

Application Category

📝 Abstract
Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
Problem

Research questions and friction points this paper is trying to address.

Addressing data sparsity and high dimensionality in gene expression analysis
Improving cancer classification and survival prediction with limited gene data
Enhancing missing value imputation in transcriptomic profiles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based autoencoder for gene expression
Masking and restoration objective for gene embeddings
Attention-based interpretability reveals gene patterns
🔎 Similar Papers
No similar papers found.