Testing for a common subspace in compositional datasets with structural zeros

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing methods struggle to test whether distinct subsets of compositional data—such as zero-containing versus fully positive subsets—share a common low-dimensional principal subspace, often introducing bias through arbitrary partitioning. This paper introduces the first hypothesis testing framework for a shared subspace in compositionally structured data with structural zeros. Leveraging the simplex-normal distribution, we derive an analytical approximation of the zero-inflated distribution and integrate nonparametric bootstrap resampling to enhance robustness. The method combines principal component analysis with rigorous subspace distance metrics to yield interpretable, statistically principled inference. Crucially, it overcomes fundamental limitations of the Aitchison geometry framework in handling structural zeros, thereby improving both statistical robustness and biological interpretability. Extensive simulations and applications to microbiome datasets demonstrate accurate detection of subspace sharing and uncover latent unified ecological drivers in real-world data.

Technology Category

Application Category

📝 Abstract

In real world applications dealing with compositional datasets, it is easy to face the presence of structural zeros. The latter arise when, due to physical limitations, one or more variables are intrinsically zero for a subset of the population under study. The classical Aitchison approach requires all the components of a composition to be strictly positive, since the adaptation of the most widely used statistical techniques to the compositional framework relies on computing the logratios of these components. Therefore, datasets containing structural zeros are usually split in two subsets, the one containing the observations with structural zeros and the one containing all the other data. Then statistical analysis is performed on the two subsets separately, assuming the two datasets are drawn from two different subpopulations. However, this approach may lead to incomplete results when the split into two populations is merely artificial. To overcome this limitation and increase the robustness of such an approach, we introduce a statistical test to check whether the first K principal components of the two datasets generate the same vector space. An approximation of the corresponding null distribution is derived analytically when data are normally distributed on the simplex and through a nonparametric bootstrap approach in the other cases. Results from simulated data demonstrate that the proposed procedure can discriminate scenarios where the subpopulations share a common subspace from those where they are actually distinct. The performance of the proposed method is also tested on an experimental dataset concerning microbiome measurements.

Problem

Research questions and friction points this paper is trying to address.

Testing for shared subspace in compositional datasets with structural zeros

Overcoming limitations of splitting datasets with structural zeros

Developing statistical test for common principal components across subpopulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical test for common subspace in compositional data

Analytical and bootstrap null distribution approximation

Handling structural zeros without splitting datasets

🔎 Similar Papers

Many Perception Tasks are Highly Redundant Functions of their Input Data