🤖 AI Summary
This paper studies the *k-subspace median* problem: finding a *k*-dimensional affine subspace in ℝᵈ that minimizes the sum of ℓ₂ (non-squared) Euclidean distances from data points to the subspace. Unlike classical *k*-PCA—which minimizes the ℓ₂,₂ norm—this problem is significantly harder due to the non-convexity of the ℓ₂,₁ mixed norm, especially when *k* < *d*−1. We present the first deterministic polynomial-time algorithm whose runtime and approximation ratio scale polynomially—not exponentially—in *k*, thereby overcoming a long-standing bottleneck in non-convex mixed-norm optimization. Our approach integrates geometric sampling, subspace approximation, and a unified ℓ₂,*z* norm optimization framework (*z* = 1, extendable to *z* ≠ 1,2). Theoretically, we achieve a √*d* multiplicative approximation guarantee. Empirical evaluation on real-world datasets confirms effectiveness, and our open-source implementation substantially enhances the computational tractability and practical applicability of robust principal component analysis.
📝 Abstract
Given an integer $kgeq1$ and a set $P$ of $n$ points in $REAL^d$, the classic $k$-PCA (Principle Component Analysis) approximates the affine emph{$k$-subspace mean} of $P$, which is the $k$-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ($ell_{2,2}$-norm) over the points of $P$, i.e., the mean of these distances. The emph{$k$-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances ($ell_{2,1}$-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the $ell_{z,z}$ (non-mixed) norms, it is non-convex for $k<d-1$.
We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in $k$. More precisely, the multiplicative approximation factor is $sqrt{d}$, and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as $ell_{2,z}$ norm of distances for $z
ot in r{1,2}$, e.g., $z=infty$, and handling outliers/sparsity.
Open code and experimental results on real-world datasets are also provided.