Contrastive Learning Enhances Language Model Based Cell Embeddings for Low-Sample Single Cell Transcriptomics

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Identifying rare cell subtypes in low-sample single-cell RNA sequencing (scRNA-seq) data remains challenging, compounded by the lack of biologically informed gene representations. Method: We propose a gene embedding framework integrating large language models (LLMs) with contrastive learning. Specifically, we encode NCBI gene descriptive texts using BioBERT, SciBERT, and text-embedding-ada-002 to generate knowledge-enhanced gene embeddings, which are jointly leveraged with scRNA-seq expression matrices to construct biologically grounded cell representations. Contribution/Results: Our approach mitigates data scarcity and significantly improves rare subtype discrimination. On a retinal ganglion cell dataset, it achieves markedly higher subtype classification accuracy and successfully identifies glaucoma-associated neurodegenerative pathways and key regulatory genes. To our knowledge, this is the first work to inject LLM-derived textual knowledge into scRNA-seq gene embedding, establishing a novel, interpretable, and knowledge-guided foundation model paradigm for single-cell analysis.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown strong ability in generating rich representations across domains such as natural language processing and generation, computer vision, and multimodal learning. However, their application in biomedical data analysis remains nascent. Single-cell transcriptomic profiling is essential for dissecting cell subtype diversity in development and disease, but rare subtypes pose challenges for scaling laws. We present a computational framework that integrates single-cell RNA sequencing (scRNA-seq) with LLMs to derive knowledge-informed gene embeddings. Highly expressed genes for each cell are mapped to NCBI Gene descriptions and embedded using models such as text-embedding-ada-002, BioBERT, and SciBERT. Applied to retinal ganglion cells (RGCs), which differ in vulnerability to glaucoma-related neurodegeneration, this strategy improves subtype classification, highlights biologically significant features, and reveals pathways underlying selective neuronal vulnerability. More broadly, it illustrates how LLM-derived embeddings can augment biological analysis under data-limited conditions and lay the groundwork for future foundation models in single-cell biology.
Problem

Research questions and friction points this paper is trying to address.

Enhancing cell embeddings for rare subtypes using contrastive learning
Integrating single-cell RNA sequencing with language models
Improving subtype classification under data-limited biological conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates single-cell RNA sequencing with large language models
Maps highly expressed genes to NCBI descriptions for embedding
Uses contrastive learning to enhance cell subtype classification
🔎 Similar Papers
No similar papers found.
L
Luxuan Zhang
Department of Neurology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, 02115, MA, USA
D
Douglas Jiang
Department of Neurology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, 02115, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA
Qinglong Wang
Qinglong Wang
Zhejiang University
AI securityAI for System
Haoqi Sun
Haoqi Sun
Department of Neurology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, 02115, MA, USA
F
Feng Tian
Department of Neurology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, 02115, MA, USA