A fast and slightly robust covariance estimator

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses robust estimation of high-dimensional covariance matrices under low contamination rate (ε = o(1/√d)): given an εn-Hamming-distance-contaminated sample, the goal is to efficiently compute an estimate with operator norm error ≤ 1/2. We propose the first algorithm achieving near-linear sample complexity Õ(d) for both sub-Gaussian and heavy-tailed distributions, with time complexity O((n + d)^ω + 1/2), where ω < 2.373. This improves significantly over prior methods requiring Ω(d³/²) or exponential time. Our core innovation is a novel robust estimation framework built upon fast matrix multiplication—bypassing computationally expensive algebraic tools such as Sum-of-Squares. The framework retains theoretical optimality in both statistical and computational efficiency while demonstrating practical efficacy through empirical evaluation.

Technology Category

Application Category

📝 Abstract
Let $mathcal{Z} = {Z_1, dots, Z_n} stackrel{mathrm{i.i.d.}}{sim} P subset mathbb{R}^d$ from a distribution $P$ with mean zero and covariance $Sigma$. Given a dataset $mathcal{X}$ such that $d_{mathrm{ham}}(mathcal{X}, mathcal{Z}) leq varepsilon n$, we are interested in finding an efficient estimator $widehat{Sigma}$ that achieves $mathrm{err}(widehat{Sigma}, Sigma) := |Sigma^{-frac{1}{2}}widehat{Sigma}Sigma^{-frac{1}{2}} - I| _{mathrm{op}} leq 1/2$. We focus on the low contamination regime $varepsilon = o(1/sqrt{d}$). In this regime, prior work required either $Omega(d^{3/2})$ samples or runtime that is exponential in $d$. We present an algorithm that, for subgaussian data, has near-linear sample complexity $n = widetilde{Omega}(d)$ and runtime $O((n+d)^{omega + frac{1}{2}})$, where $omega$ is the matrix multiplication exponent. We also show that this algorithm works for heavy-tailed data with near-linear sample complexity, but in a smaller regime of $varepsilon$. Concurrent to our work, Diakonikolas et al. [2024] give Sum-of-Squares estimators that achieve similar sample complexity but with large polynomial runtime.
Problem

Research questions and friction points this paper is trying to address.

Estimates covariance efficiently with low sample complexity.
Handles subgaussian and heavy-tailed data with near-linear runtime.
Improves upon prior methods requiring exponential runtime.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Near-linear sample complexity for subgaussian data
Efficient runtime with matrix multiplication exponent
Applicable to heavy-tailed data in limited regime
🔎 Similar Papers
No similar papers found.