🤖 AI Summary
Existing single-cell pre-trained language models (PLMs) suffer from modality disconnection with text PLMs, hindering cross-modal tasks; mainstream fusion approaches further incur information loss and inadequate unimodal representation learning. To address this, we propose scMMGPT—a unified multimodal generative pre-trained Transformer for single-cell data—introducing the first cell-text dual-modality alignment architecture. It integrates a cross-modal projector, modality adapters, and a k-NN–conditioned generation mechanism. Trained jointly on 27 million single cells and their associated literature, scMMGPT employs contrastive learning and masked reconstruction objectives. Experiments demonstrate substantial improvements: 84% gain in BLEU score for cell description generation, 20.5% increase in cell-type annotation accuracy, and 4% improvement in k-NN accuracy for text-guided pseudo-cell generation. scMMGPT effectively bridges the cross-modal semantic gap and enables robust bidirectional knowledge transfer.
📝 Abstract
Pre-trained language models (PLMs) have revolutionized scientific research, yet their application to single-cell analysis remains limited. Text PLMs cannot process single-cell RNA sequencing data, while cell PLMs lack the ability to handle free text, restricting their use in multimodal tasks. Existing efforts to bridge these modalities often suffer from information loss or inadequate single-modal pre-training, leading to suboptimal performances. To address these challenges, we propose Single-Cell MultiModal Generative Pre-trained Transformer (scMMGPT), a unified PLM for joint cell and text modeling. scMMGPT effectively integrates the state-of-the-art cell and text PLMs, facilitating cross-modal knowledge sharing for improved performance. To bridge the text-cell modality gap, scMMGPT leverages dedicated cross-modal projectors, and undergoes extensive pre-training on 27 million cells -- the largest dataset for multimodal cell-text PLMs to date. This large-scale pre-training enables scMMGPT to excel in joint cell-text tasks, achieving an 84% relative improvement of textual discrepancy for cell description generation, 20.5% higher accuracy for cell type annotation, and 4% improvement in $k$-NN accuracy for text-conditioned pseudo-cell generation, outperforming baselines.