Increasing Missingness to Reduce Bias: Richardson-SGD with Missing Data

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the systematic gradient bias induced by imputation in stochastic gradient descent (SGD) when covariates are missing, which compromises both optimization and estimation accuracy. The authors propose a debiased SGD method based on Richardson extrapolation that deliberately introduces controlled missingness into the originally incomplete data and combines stochastic gradients computed under multiple missingness levels to cancel the leading bias term. Innovatively “adding missingness to reduce bias,” this approach is the first to apply Richardson extrapolation to SGD under missing data, offering model-agnosticism and computational efficiency. Theoretically, a single-step extrapolation reduces the gradient bias from O(|p|) to O(|p|²), where p denotes the missingness mechanism. Experiments demonstrate substantial improvements in optimization and estimation performance across various generalized linear models, with compatibility to mainstream imputation methods such as MICE.

📝 Abstract

Stochastic gradient methods are central to modern large-scale learning, but their use with incomplete covariates remains delicate since imputation schemes generally introduce systematic gradient biases, as shown for linear models. In this work, we prove that all parametric models exhibit similar gradient bias for various imputation procedures and characterize exactly the dependence on the missingness ratio vector $p$, with $O(\|p\|)$ as the leading term. We exploit this analysis to propose a simple debiasing procedure for stochastic gradient descent (SGD) with missing values based on Richardson extrapolation, which leverages the exact expression of the gradient bias. The key idea is to \emph{deliberately add missingness}: from an already incomplete observation, we generate a further-thinned version at a higher, controlled missingness level, and combine the two resulting stochastic gradients to cancel the leading bias term. We prove that one Richardson step reduces the gradient bias from $O(\|p\|)$ to $O(\|p\|^2)$ under several missingness scenarios. Our proposed method is computationally efficient, model-agnostic and applies to any parametric loss whose stochastic gradient can be computed after imputation. Furthermore, when missing indicators are independent, the population gradient bias is a multilinear polynomial in $p$ and depends only on population gradient errors induced by declaring a single coordinate missing. In this case, our method generalizes to a multi-step Richardson procedure which recursively cancels higher-order terms. Empirically, Richardson debiasing improves optimization and estimation across several generalized linear models and combines positively with widely used imputation procedures such as MICE. These results suggest that, somewhat counter-intuitively, adding controlled missingness on top of existing missing data can make stochastic learning from incomplete data more accurate.

Problem

Research questions and friction points this paper is trying to address.

missing data

gradient bias

stochastic gradient descent

imputation

Richardson extrapolation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Richardson extrapolation

stochastic gradient descent

missing data