A fast and slightly robust covariance estimator

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

This paper addresses robust estimation of high-dimensional covariance matrices under low contamination rate (ε = o(1/√d)): given an εn-Hamming-distance-contaminated sample, the goal is to efficiently compute an estimate with operator norm error ≤ 1/2. We propose the first algorithm achieving near-linear sample complexity Õ(d) for both sub-Gaussian and heavy-tailed distributions, with time complexity O((n + d)^ω + 1/2), where ω < 2.373. This improves significantly over prior methods requiring Ω(d³/²) or exponential time. Our core innovation is a novel robust estimation framework built upon fast matrix multiplication—bypassing computationally expensive algebraic tools such as Sum-of-Squares. The framework retains theoretical optimality in both statistical and computational efficiency while demonstrating practical efficacy through empirical evaluation.

Technology Category

Application Category

📝 Abstract

Let $mathcal{Z} = {Z_1, dots, Z_n} stackrel{mathrm{i.i.d.}}{sim} P subset mathbb{R}^d$ from a distribution $P$ with mean zero and covariance $Sigma$. Given a dataset $mathcal{X}$ such that $d_{mathrm{ham}}(mathcal{X}, mathcal{Z}) leq varepsilon n$, we are interested in finding an efficient estimator $widehat{Sigma}$ that achieves $mathrm{err}(widehat{Sigma}, Sigma) := |Sigma^{-frac{1}{2}}widehat{Sigma}Sigma^{-frac{1}{2}} - I| _{mathrm{op}} leq 1/2$. We focus on the low contamination regime $varepsilon = o(1/sqrt{d}$). In this regime, prior work required either $Omega(d^{3/2})$ samples or runtime that is exponential in $d$. We present an algorithm that, for subgaussian data, has near-linear sample complexity $n = widetilde{Omega}(d)$ and runtime $O((n+d)^{omega + frac{1}{2}})$, where $omega$ is the matrix multiplication exponent. We also show that this algorithm works for heavy-tailed data with near-linear sample complexity, but in a smaller regime of $varepsilon$. Concurrent to our work, Diakonikolas et al. [2024] give Sum-of-Squares estimators that achieve similar sample complexity but with large polynomial runtime.

Problem

Research questions and friction points this paper is trying to address.

Estimates covariance efficiently with low sample complexity.

Handles subgaussian and heavy-tailed data with near-linear runtime.

Improves upon prior methods requiring exponential runtime.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Near-linear sample complexity for subgaussian data

Efficient runtime with matrix multiplication exponent

Applicable to heavy-tailed data in limited regime

🔎 Similar Papers

Posterior Covariance Structures in Gaussian Processes