🤖 AI Summary
In federated learning settings with high-dimensional data (d) and massive samples (n), existing PCA methods struggle to simultaneously ensure privacy preservation and computational efficiency.
Method: We propose the first federated PCA framework supporting dual-path parallel-distributed coordination—across both dimensions and samples—integrating p-dimensional fast randomized sketching, L-way parallel decomposition, distributed aggregation, and non-asymptotic error analysis.
Contribution/Results: We theoretically establish a tight convergence error bound matching that of centralized PCA. We discover, for the first time, a phase-transition phenomenon in distributed PCA solutions driven by the Lp norm and derive their asymptotic distribution. Experiments on synthetic data and the 1000 Genomes Project demonstrate superior performance: reconstruction error reduced by 32%–58%, speedup up to 12.7×, and successful resolution of fine-scale population genetic structure—all while preserving data privacy.
📝 Abstract
Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension $d$ and the sample size $n$ are ultra-large, by simultaneously performing parallel computing along $d$ and distributed computing along $n$. Specifically, we utilize $L$ parallel copies of $p$-dimensional fast sketches to divide the computing burden along $d$ and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when $Lp ge d$. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as $Lp$ increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.