Parameter-free representations outperform single-cell foundation models on downstream benchmarks

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work challenges the prevailing reliance on complex deep learning models in single-cell analysis by investigating whether parameter-free linear methods can achieve downstream performance comparable to—or even surpassing—that of state-of-the-art single-cell foundation models. We propose a linear framework grounded in rigorous normalization procedures and interpretable gene expression representations, which generates effective features without any training. Empirical evaluations demonstrate that our approach matches or approaches the performance of current best methods across multiple benchmarks, and notably outperforms Transformer-based single-cell foundation models in out-of-distribution generalization tasks, such as cross-cell-type and cross-species settings. This study provides the first compelling evidence of the substantial potential of simple linear representations in single-cell genomics.

Technology Category

Application Category

📝 Abstract

Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.

Problem

Research questions and friction points this paper is trying to address.

single-cell RNA sequencing

foundation models

parameter-free representations

downstream benchmarks

out-of-distribution generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-free representation

single-cell RNA-seq

linear methods