Confidence Intervals for Linear Models with Arbitrary Noise Contamination

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of constructing confidence intervals for linear regression coefficients under the Huber contamination model, where an unknown fraction ε of arbitrary outliers invalidates conventional inference. To tackle this challenge, we propose a novel Z-estimation-based algorithm: first, decorrelating covariates to mitigate design matrix correlation; second, employing a smoothed estimating function to achieve adaptive robust estimation of the regression parameters without prior knowledge of ε. The resulting confidence intervals attain uniform coverage at level (1−α) over all ε-contaminated distributions and achieve optimal width O(1/√[n(1−ε)²]), substantially improving upon existing adaptive methods. Our key contribution lies in the first unified integration of the Z-estimation framework, smoothing techniques, and covariate decorrelation—enabling consistent, efficient, and robust statistical inference under unknown contamination.

Technology Category

Application Category

📝 Abstract
We study confidence interval construction for linear regression under Huber's contamination model, where an unknown fraction of noise variables is arbitrarily corrupted. While robust point estimation in this setting is well understood, statistical inference remains challenging, especially because the contamination proportion is not identifiable from the data. We develop a new algorithm that constructs confidence intervals for individual regression coefficients without any prior knowledge of the contamination level. Our method is based on a Z-estimation framework using a smooth estimating function. The method directly quantifies the uncertainty of the estimating equation after a preprocessing step that decorrelates covariates associated with the nuisance parameters. We show that the resulting confidence interval has valid coverage uniformly over all contamination distributions and attains an optimal length of order $O(1/sqrt{n(1-epsilon)^2})$, matching the rate achievable when the contamination proportion $epsilon$ is known. This result stands in sharp contrast to the adaptation cost of robust interval estimation observed in the simpler Gaussian location model.
Problem

Research questions and friction points this paper is trying to address.

Constructing confidence intervals for linear regression with arbitrary noise contamination
Developing inference methods without prior knowledge of contamination proportion
Achieving optimal interval length matching known contamination rate performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence intervals without contamination level knowledge
Z-estimation with smooth function decorrelating covariates
Uniform coverage and optimal length over contamination distributions
🔎 Similar Papers
No similar papers found.
D
Dong Xie
University of Chicago
C
Chao Gao
University of Chicago
John Lafferty
John Lafferty
Yale University
Machine Learning