🤖 AI Summary
This work addresses the challenge of robust covariance estimation in online settings where data volume grows continuously and contamination rates increase, rendering traditional estimators vulnerable to bias and masking effects. The authors propose a novel method that simultaneously estimates the geometric median and the median-based covariance matrix in a streaming fashion—integrating these two robust statistics for the first time in an online framework. By computing Mahalanobis distances in real time using these estimates, the approach enables effective outlier detection while mitigating masking effects. The method maintains computational efficiency suitable for real-time applications and significantly enhances the robustness of covariance estimation. Experimental results on synthetic data demonstrate its accuracy in recovering true covariance structures and reliably identifying anomalies under contamination.
📝 Abstract
Robust estimation of the covariance matrix and detection of outliers remain major challenges in statistical data analysis, particularly when the proportion of contaminated observations increases with the size of the dataset. Outliers can severely bias parameter estimates and induce a masking effect, whereby some outliers conceal the presence of other outliers, further complicating their detection. Although many approaches have been proposed for covariance estimation and outlier detection, to our knowledge, none of these methods have been implemented in an online setting. In this paper, we focus on online covariance matrix estimation and outlier detection. Specifically, we propose a new method for simultaneously and online estimating the geometric median and variance, which allows us to calculate the Mahalanobis distance for each incoming data point before deciding whether it should be considered an outlier. To mitigate the masking effect, robust estimation techniques for the mean and variance are required. Our approach uses the geometric median for robust estimation of the location and the median covariance matrix for robust estimation of the dispersion parameters. The new online methods proposed for parameter estimation and outlier detection allow real-time identification of outliers as data are observed sequentially. The performance of our methods is demonstrated on simulated datasets.