🤖 AI Summary
To address insufficient accuracy in low-rank matrix estimation for target populations under heterogeneous data, this paper proposes a transfer learning framework based on latent subspace alignment. Leveraging similarities between source and target populations in row- and column-wise latent subspaces, we formulate a low-rank approximation model regularized by a Procrustes distance penalty to explicitly account for subspace discrepancies. An adaptive cross-validation strategy is further designed to accommodate inter-population heterogeneity. This work is the first to explicitly embed subspace alignment into transferable low-rank estimation, circumventing strong distributional assumptions. Implemented in R, the method demonstrates substantial improvements in estimation accuracy over baseline approaches using target data only—particularly when the source signal exhibits high signal-to-noise ratio—as validated through extensive simulations and a reanalysis of genome-wide association study (GWAS) data from the Japanese Biobank.
📝 Abstract
Low-rank matrix estimation is a fundamental problem in statistics and machine learning. In the context of heterogeneous data generated from diverse sources, a key challenge lies in leveraging data from a source population to enhance the estimation of a low-rank matrix in a target population of interest. One such example is estimating associations between genetic variants and diseases in non-European ancestry groups. We propose an approach that leverages similarity in the latent row and column spaces between the source and target populations to improve estimation in the target population, which we refer to as LatEnt spAce-based tRaNsfer lEaRning (LEARNER). LEARNER is based on performing a low-rank approximation of the target population data which penalizes differences between the latent row and column spaces between the source and target populations. We present a cross-validation approach that allows the method to adapt to the degree of heterogeneity across populations. We conducted extensive simulations which found that LEARNER often outperforms the benchmark approach that only uses the target population data, especially as the signal-to-noise ratio in the source population increases. We also performed an illustrative application and empirical comparison of LEARNER and benchmark approaches in a re-analysis of a genome-wide association study in the BioBank Japan cohort. LEARNER is implemented in the R package learner.