Simple and Optimal Sublinear Algorithms for Mean Estimation

📅 2024-06-07

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This paper studies the sublinear estimation of the mean of a point set in $d$-dimensional Euclidean space: given only a small number of random samples, compute a $(1+varepsilon)$-approximation to the true mean—the minimizer of the sum of squared distances—with probability at least $1-delta$. We establish, for the first time, the optimal sample complexity $Theta(varepsilon^{-1} log delta^{-1})$. Two sublinear-time algorithms are proposed: (1) an accelerated gradient descent method with time complexity $O((varepsilon^{-1} + log log delta^{-1}) log delta^{-1} cdot d)$; and (2) a novel geometric median-of-means framework integrating order statistics and clustering, achieving $O((varepsilon^{-1} + log^gamma delta^{-1}) log delta^{-1} cdot d)$ complexity. Our key innovation is the generalization of the classical median-of-means estimator to the *geometric* median-of-means, accompanied by a unified analysis of its robustness and convergence—substantially improving estimation efficiency and theoretical guarantees under high-dimensional, sparse sampling.

Technology Category

Application Category

📝 Abstract

We study the sublinear multivariate mean estimation problem in $d$-dimensional Euclidean space. Specifically, we aim to find the mean $mu$ of a ground point set $A$, which minimizes the sum of squared Euclidean distances of the points in $A$ to $mu$. We first show that a multiplicative $(1+varepsilon)$ approximation to $mu$ can be found with probability $1-delta$ using $O(varepsilon^{-1}log delta^{-1})$ many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two sublinear time algorithms with optimal sample complexity for extracting a suitable approximate mean: 1. A gradient descent approach running in time $O((varepsilon^{-1}+loglog delta^{-1})cdot log delta^{-1} cdot d)$. It optimizes the geometric median objective while being significantly faster for our specific setting than all other known algorithms for this problem. 2. An order statistics and clustering approach running in time $Oleft((varepsilon^{-1}+log^{gamma}delta^{-1})cdot log delta^{-1} cdot d ight)$ for any constant $gamma>0$. Throughout our analysis, we also generalize the familiar median-of-means estimator to the multivariate case, showing that the geometric median-of-means estimator achieves an optimal sample complexity for estimating $mu$, which may be of independent interest.

Problem

Research questions and friction points this paper is trying to address.

Multi-dimensional Space

Estimation

Minimum Distance Sum

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Descent Optimization

Sorting and Clustering Algorithm

Multivariate Median Estimation

🔎 Similar Papers

Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique