🤖 AI Summary
This paper addresses the robust geometric median problem in Euclidean space, focusing on coreset construction that eliminates dependence on the number $ m $ of outliers—enabling more compact and adaptive data compression. The proposed method introduces a novel non-componentwise error analysis framework, the first to achieve a coreset size independent of $ m $; in one dimension, it attains the theoretical optimum and uncovers fundamental distinctions from the classical median problem. Technically, the approach integrates robust statistical analysis, geometry-aware sensitivity sampling, and $ varepsilon $-approximation theory, supporting natural extensions to diverse metric spaces. Experiments demonstrate that the algorithm achieves optimal trade-offs among accuracy, compression ratio, and runtime efficiency, while maintaining strong robustness even under adversarial outlier assumptions.
📝 Abstract
We study the robust geometric median problem in Euclidean space $mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact summary of a dataset $P$ of size $n$ that approximates the robust cost for all centers $c$ within a multiplicative error $varepsilon$. Given an outlier count $m$, we construct a coreset of size $ ilde{O}(varepsilon^{-2} cdot min{varepsilon^{-2}, d})$ when $n geq 4m$, eliminating the $O(m)$ dependency present in prior work [Huang et al., 2022&2023]. For the special case of $d = 1$, we achieve an optimal coreset size of $ ilde{Theta}(varepsilon^{-1/2} + frac{m}{n} varepsilon^{-1})$, revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust $(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.