π€ AI Summary
In high-dimensional data, structured background noise often obscures low-dimensional shared signals, rendering standard PCA ineffective. This paper proposes PCA++, a robust subspace estimation method based on contrastive learning, specifically designed for positive sample pairsβeach containing identical underlying signals but distinct background noise. Its key innovation is the introduction of a hard uniformity constraint, jointly optimized with an alignment objective, yielding a closed-form solution via generalized eigenvalue decomposition. Theoretically, we establish that uniformity substantially enhances statistical robustness under high-dimensional, strong background noise and provide asymptotic consistency guarantees. Experiments on synthetic data, corrupted MNIST, and single-cell transcriptomic datasets demonstrate that PCA++ stably recovers condition-invariant latent structures, significantly outperforming both standard PCA and PCA+, a baseline relying solely on alignment.
π Abstract
High-dimensional data often contain low-dimensional signals obscured by structured background noise, which limits the effectiveness of standard PCA. Motivated by contrastive learning, we address the problem of recovering shared signal subspaces from positive pairs, paired observations sharing the same signal but differing in background. Our baseline, PCA+, uses alignment-only contrastive learning and succeeds when background variation is mild, but fails under strong noise or high-dimensional regimes. To address this, we introduce PCA++, a hard uniformity-constrained contrastive PCA that enforces identity covariance on projected features. PCA++ has a closed-form solution via a generalized eigenproblem, remains stable in high dimensions, and provably regularizes against background interference. We provide exact high-dimensional asymptotics in both fixed-aspect-ratio and growing-spike regimes, showing uniformity's role in robust signal recovery. Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity's role in contrastive learning, showing that explicit feature dispersion defends against structured noise and enhances robustness.