🤖 AI Summary
This paper addresses robust estimation of high-dimensional covariance matrices under low contamination rate (ε = o(1/√d)): given an εn-Hamming-distance-contaminated sample, the goal is to efficiently compute an estimate with operator norm error ≤ 1/2. We propose the first algorithm achieving near-linear sample complexity Õ(d) for both sub-Gaussian and heavy-tailed distributions, with time complexity O((n + d)^ω + 1/2), where ω < 2.373. This improves significantly over prior methods requiring Ω(d³/²) or exponential time. Our core innovation is a novel robust estimation framework built upon fast matrix multiplication—bypassing computationally expensive algebraic tools such as Sum-of-Squares. The framework retains theoretical optimality in both statistical and computational efficiency while demonstrating practical efficacy through empirical evaluation.
📝 Abstract
Let $mathcal{Z} = {Z_1, dots, Z_n} stackrel{mathrm{i.i.d.}}{sim} P subset mathbb{R}^d$ from a distribution $P$ with mean zero and covariance $Sigma$. Given a dataset $mathcal{X}$ such that $d_{mathrm{ham}}(mathcal{X}, mathcal{Z}) leq varepsilon n$, we are interested in finding an efficient estimator $widehat{Sigma}$ that achieves $mathrm{err}(widehat{Sigma}, Sigma) := |Sigma^{-frac{1}{2}}widehat{Sigma}Sigma^{-frac{1}{2}} - I| _{mathrm{op}} leq 1/2$. We focus on the low contamination regime $varepsilon = o(1/sqrt{d}$). In this regime, prior work required either $Omega(d^{3/2})$ samples or runtime that is exponential in $d$. We present an algorithm that, for subgaussian data, has near-linear sample complexity $n = widetilde{Omega}(d)$ and runtime $O((n+d)^{omega + frac{1}{2}})$, where $omega$ is the matrix multiplication exponent. We also show that this algorithm works for heavy-tailed data with near-linear sample complexity, but in a smaller regime of $varepsilon$. Concurrent to our work, Diakonikolas et al. [2024] give Sum-of-Squares estimators that achieve similar sample complexity but with large polynomial runtime.