🤖 AI Summary
Identifying rare cell subtypes in low-sample single-cell RNA sequencing (scRNA-seq) data remains challenging, compounded by the lack of biologically informed gene representations. Method: We propose a gene embedding framework integrating large language models (LLMs) with contrastive learning. Specifically, we encode NCBI gene descriptive texts using BioBERT, SciBERT, and text-embedding-ada-002 to generate knowledge-enhanced gene embeddings, which are jointly leveraged with scRNA-seq expression matrices to construct biologically grounded cell representations. Contribution/Results: Our approach mitigates data scarcity and significantly improves rare subtype discrimination. On a retinal ganglion cell dataset, it achieves markedly higher subtype classification accuracy and successfully identifies glaucoma-associated neurodegenerative pathways and key regulatory genes. To our knowledge, this is the first work to inject LLM-derived textual knowledge into scRNA-seq gene embedding, establishing a novel, interpretable, and knowledge-guided foundation model paradigm for single-cell analysis.
📝 Abstract
Large language models (LLMs) have shown strong ability in generating rich representations across domains such as natural language processing and generation, computer vision, and multimodal learning. However, their application in biomedical data analysis remains nascent. Single-cell transcriptomic profiling is essential for dissecting cell subtype diversity in development and disease, but rare subtypes pose challenges for scaling laws. We present a computational framework that integrates single-cell RNA sequencing (scRNA-seq) with LLMs to derive knowledge-informed gene embeddings. Highly expressed genes for each cell are mapped to NCBI Gene descriptions and embedded using models such as text-embedding-ada-002, BioBERT, and SciBERT. Applied to retinal ganglion cells (RGCs), which differ in vulnerability to glaucoma-related neurodegeneration, this strategy improves subtype classification, highlights biologically significant features, and reveals pathways underlying selective neuronal vulnerability. More broadly, it illustrates how LLM-derived embeddings can augment biological analysis under data-limited conditions and lay the groundwork for future foundation models in single-cell biology.