Contrastive Learning Enhances Language Model Based Cell Embeddings for Low-Sample Single Cell Transcriptomics

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Identifying rare cell subtypes in low-sample single-cell RNA sequencing (scRNA-seq) data remains challenging, compounded by the lack of biologically informed gene representations. Method: We propose a gene embedding framework integrating large language models (LLMs) with contrastive learning. Specifically, we encode NCBI gene descriptive texts using BioBERT, SciBERT, and text-embedding-ada-002 to generate knowledge-enhanced gene embeddings, which are jointly leveraged with scRNA-seq expression matrices to construct biologically grounded cell representations. Contribution/Results: Our approach mitigates data scarcity and significantly improves rare subtype discrimination. On a retinal ganglion cell dataset, it achieves markedly higher subtype classification accuracy and successfully identifies glaucoma-associated neurodegenerative pathways and key regulatory genes. To our knowledge, this is the first work to inject LLM-derived textual knowledge into scRNA-seq gene embedding, establishing a novel, interpretable, and knowledge-guided foundation model paradigm for single-cell analysis.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown strong ability in generating rich representations across domains such as natural language processing and generation, computer vision, and multimodal learning. However, their application in biomedical data analysis remains nascent. Single-cell transcriptomic profiling is essential for dissecting cell subtype diversity in development and disease, but rare subtypes pose challenges for scaling laws. We present a computational framework that integrates single-cell RNA sequencing (scRNA-seq) with LLMs to derive knowledge-informed gene embeddings. Highly expressed genes for each cell are mapped to NCBI Gene descriptions and embedded using models such as text-embedding-ada-002, BioBERT, and SciBERT. Applied to retinal ganglion cells (RGCs), which differ in vulnerability to glaucoma-related neurodegeneration, this strategy improves subtype classification, highlights biologically significant features, and reveals pathways underlying selective neuronal vulnerability. More broadly, it illustrates how LLM-derived embeddings can augment biological analysis under data-limited conditions and lay the groundwork for future foundation models in single-cell biology.

Problem

Research questions and friction points this paper is trying to address.

Enhancing cell embeddings for rare subtypes using contrastive learning

Integrating single-cell RNA sequencing with language models

Improving subtype classification under data-limited biological conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates single-cell RNA sequencing with large language models

Maps highly expressed genes to NCBI descriptions for embedding

Uses contrastive learning to enhance cell subtype classification

🔎 Similar Papers

No similar papers found.