🤖 AI Summary
Single-cell sequencing yields high-dimensional, irregular cell point clouds, posing challenges for direct quantification of inter-individual biological variation; moreover, existing nonlinear models (e.g., kernel methods, deep networks) lack interpretability. To address this, we propose a unified analytical framework based on Linear Optimal Transport (LOT): patient-level point clouds are embedded into a fixed-dimensional Euclidean space via LOT barycentric averaging for distribution alignment, enabling both linear reconstruction and inverse mapping. Our method achieves high predictive accuracy, biological interpretability—classifier weights are directly attributable to key marker genes—and generative capability—synthesizing biologically plausible, patient-specific organoid-like data. Applied to multi-omics COVID-19 datasets, the model delivers accurate and interpretable disease-state classification and facilitates mechanistic investigation of drug–disease interactions.
📝 Abstract
Single-cell technologies generate high-dimensional point clouds of cells, enabling detailed characterization of complex patient states and treatment responses. Yet each patient is represented by an irregular point cloud rather than a simple vector, making it difficult to directly quantify and compare biological differences between individuals. Nonlinear methods such as kernels and neural networks achieve predictive accuracy but act as black boxes, offering little biological interpretability.
To address these limitations, we adapt the Linear Optimal Transport (LOT) framework to this setting, embedding irregular point clouds into a fixed-dimensional Euclidean space while preserving distributional structure. This embedding provides a principled linear representation that preserves optimal transport geometry while enabling downstream analysis. It also forms a registration between any two patients, enabling direct comparison of their cellular distributions. Within this space, LOT enables: (i) extbf{accurate and interpretable classification} of COVID-19 patient states, where classifier weights map back to specific markers and spatial regions driving predictions; and (ii) extbf{synthetic data generation} for patient-derived organoids, exploiting the linearity of the LOT embedding. LOT barycenters yield averaged cellular profiles representing combined conditions or samples, supporting drug interaction testing.
Together, these results establish LOT as a unified framework that bridges predictive performance, interpretability, and generative modeling. By transforming heterogeneous point clouds into structured embeddings directly traceable to the original data, LOT opens new opportunities for understanding immune variation and treatment effects in high-dimensional biological systems.