🤖 AI Summary
This work addresses the problem of efficient robust linear regression under high-dimensional Gaussian covariates with unknown covariance and adversarial contamination. The authors propose a near-linear-time algorithm that, when the contamination rate ε and the condition number κ of the covariance matrix satisfy εκ ≲ 1, achieves the minimax-optimal prediction error O(√(εκ)) using only Õ(d/ε⁴) samples—significantly improving upon prior methods. Furthermore, they establish fundamental limits for efficient algorithms by proving, via statistical query (SQ) lower bounds and low-degree polynomial hardness, that any computationally efficient SQ algorithm attaining better error must require Ω(d²) samples, thereby characterizing the sample complexity frontier for efficient estimation in this setting.
📝 Abstract
We revisit the problem of robust linear regression under Gaussian covariates with an unknown covariance matrix of condition number $κ$. For this fundamental problem, significant gaps remain in our understanding of the trade-offs among sample complexity, condition number, runtime, and prediction error for efficient algorithms. Our first result is a near-linear-time algorithm that uses $\widetilde{O}(d/ε^4)$ samples, where $d$ is the dimension and $ε$ is the corruption rate, and achieves prediction error $O(\sqrt{εκ})$ under the condition $εκ\lesssim 1$, improving over all prior works. We complement this result with a Statistical Query (SQ) lower bound showing that efficient SQ algorithms achieving error $o(\sqrt{εκ})$ when $εκ\lesssim 1$ require queries that take $Ω(d^2)$ samples to simulate. Finally, we prove a low-degree polynomial lower bound that gives fine-grained evidence that, without assumptions such as $εκ\lesssim 1$, efficient algorithms may require $\tildeΩ\left(\min\{dε^{2}κ^{2},\ ε^{2}d^{2}\}\right)$ samples to significantly outperform the trivial estimator that always guesses $0$.