Running in circles: practical limitations for real-life application of data fission and data thinning in post-clustering differential analysis

📅 2024-05-22

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In post-clustering differential expression analysis of scRNA-seq data, data splitting fails under unknown cluster structures: it relies on known cluster labels to estimate cluster-specific scale parameters, thereby violating independence and severely inflating Type I error. This work identifies its fundamental limitations—the non-mixture assumption is invalid under mixture distributions, and conditional data splitting requires prior cluster knowledge. To address this, we propose a neighborhood-based heteroscedastic nonparametric scale estimation framework. We provide the first theoretical quantification of how scale estimation bias propagates to Type I error inflation and derive a tight upper bound on the resulting error. We prove that standard data splitting inevitably fails without ground-truth clusters, whereas our method remains robust provided sufficient cluster separation. Our analysis rigorously establishes the practical applicability boundary of data splitting and delivers a verifiable, nonparametric alternative for post-clustering inference.

Technology Category

Application Category

📝 Abstract

Post-clustering inference in scRNA-seq analysis presents significant challenges in controlling Type I error during Differential Expression Analysis. Data fission, a promising approach, aims to split the data into two new independent parts, but relies on strong parametric assumptions of non-mixture distributions, which are violated in clustered data. We show that applying data fission to these mixtures requires knowledge of the clustering structure to accurately estimate component-specific scale parameters. These estimates are critical for ensuring decomposition and independence. We theoretically quantify the direct impact of the bias in estimating this scales parameters on the inflation of the Type I error rate, caused by a deviation from the independence. Since component structures are unknown in practice, we propose a heteroscedastic model with non-parametric estimators for individual scale parameters. This model uses proximity between observations to capture the effect of the underlying mixture on data dispersion. While this approach works well when clusters are well-separated, it introduces bias when separation is weak, highlighting the difficulty of applying data fission in real-world scenarios with unknown degrees of separation.

Problem

Research questions and friction points this paper is trying to address.

Addresses Type I error control in post-clustering differential expression analysis

Identifies violations of parametric assumptions in clustered scRNA-seq data

Reveals fundamental limitations of data fission in real-world applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional data fission for mixture decomposition

Requires prior clustering knowledge for estimation

Addresses independence deviation in Type I error

🔎 Similar Papers

No similar papers found.