Linear Regression under Missing or Corrupted Coordinates

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This paper investigates the fundamental estimation limits of multivariate linear regression under Gaussian covariates in two adversarial settings: adversarial missing data and adversarial corruption. In the missing-data setting, an adversary may delete up to an η-fraction of samples coordinate-wise (non-randomly, with known locations); in the corruption setting, the adversary may arbitrarily perturb up to an η-fraction of samples (with unknown locations). Unlike random missingness or fully observed settings, estimation error remains strictly positive and does not vanish asymptotically with sample size. The paper establishes, for nearly the entire parameter regime, the exact information-theoretic lower bounds on estimation error for both settings—and proves their equivalence, revealing that knowledge of contamination locations confers no statistical advantage. Furthermore, it proposes efficient algorithms achieving these bounds up to constant factors, thereby unifying the minimax error theory and achievability analysis for adversarial missingness and corruption.

Technology Category

Application Category

📝 Abstract

We study multivariate linear regression under Gaussian covariates in two settings, where data may be erased or corrupted by an adversary under a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $eta$-fraction of samples per coordinate; a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even information-theoretically. Unlike the clean setting, where estimation error vanishes with more samples, here the optimal error remains a positive function of the problem parameters. Our main contribution is to characterize this error up to constant factors across essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error that match the error of (computationally efficient) algorithms. A key implication is that, perhaps surprisingly, the optimal error in the missing data setting matches that in the corruption setting-so knowing the corruption locations offers no general advantage.

Problem

Research questions and friction points this paper is trying to address.

Studies linear regression with adversarial data erasure or corruption

Characterizes optimal error bounds for missing and corrupted data settings

Shows knowing corruption locations offers no general advantage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Characterizes optimal error bounds for adversarial missing data

Matches information-theoretic lower bounds with efficient algorithms

Shows corruption location knowledge provides no general advantage

🔎 Similar Papers

Adaptive Optimization for Prediction with Missing Data