$k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This paper studies the *k-subspace median* problem: finding a *k*-dimensional affine subspace in ℝᵈ that minimizes the sum of ℓ₂ (non-squared) Euclidean distances from data points to the subspace. Unlike classical *k*-PCA—which minimizes the ℓ₂,₂ norm—this problem is significantly harder due to the non-convexity of the ℓ₂,₁ mixed norm, especially when *k* < *d*−1. We present the first deterministic polynomial-time algorithm whose runtime and approximation ratio scale polynomially—not exponentially—in *k*, thereby overcoming a long-standing bottleneck in non-convex mixed-norm optimization. Our approach integrates geometric sampling, subspace approximation, and a unified ℓ₂,*z* norm optimization framework (*z* = 1, extendable to *z* ≠ 1,2). Theoretically, we achieve a √*d* multiplicative approximation guarantee. Empirical evaluation on real-world datasets confirms effectiveness, and our open-source implementation substantially enhances the computational tractability and practical applicability of robust principal component analysis.

Technology Category

Application Category

📝 Abstract

Given an integer $kgeq1$ and a set $P$ of $n$ points in $REAL^d$, the classic $k$-PCA (Principle Component Analysis) approximates the affine emph{$k$-subspace mean} of $P$, which is the $k$-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ($ell_{2,2}$-norm) over the points of $P$, i.e., the mean of these distances. The emph{$k$-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances ($ell_{2,1}$-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the $ell_{z,z}$ (non-mixed) norms, it is non-convex for $k<d-1$. We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in $k$. More precisely, the multiplicative approximation factor is $sqrt{d}$, and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as $ell_{2,z}$ norm of distances for $z ot in r{1,2}$, e.g., $z=infty$, and handling outliers/sparsity. Open code and experimental results on real-world datasets are also provided.

Problem

Research questions and friction points this paper is trying to address.

Approximates k-subspace median for non-squared Euclidean distances

Provides polynomial-time deterministic algorithm with √d approximation

Handles robustness to noise and outliers in PCA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Polynomial-time deterministic algorithm for k-PCA

Approximation factor of sqrt(d) for subspace median

Handles non-squared Euclidean distances efficiently

🔎 Similar Papers

Effective and General Distance Computation for Approximate Nearest Neighbor Search