Approximations to worst-case data dropping: unmasking failure modes

📅 2024-08-16

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the challenge of robustness diagnostics in ordinary least squares (OLS) linear regression—specifically, whether removing a small number of data points induces substantial changes in statistical inference. We demonstrate that mainstream approximation methods systematically fail to detect influential subsets under realistic data configurations. To resolve this, we propose a recursive greedy algorithm and provide the first rigorous proof of its 100% detection success rate across multiple benchmark datasets. Comprehensive evaluations on both synthetic and real-world data show that our method achieves perfect detection accuracy (zero failure rate), outperforming existing approaches by up to several orders of magnitude in computational efficiency. Our core contribution lies in exposing the fundamental theoretical limitations of approximate influence assessment and delivering the first exact solution that simultaneously offers provable guarantees and practical reliability.

Technology Category

Application Category

📝 Abstract

A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Checking this non-robustness directly poses a combinatorial optimization problem and is intractable even for simple models and moderate data sizes. Recently various authors have proposed a diverse set of approximations to detect this non-robustness. In the present work, we show that, even in a setting as simple as ordinary least squares (OLS) linear regression, many of these approximations can fail to detect (true) non-robustness in realistic data arrangements. We focus on OLS in the present work due its widespread use and since some approximations work only for OLS. Across our synthetic and real-world data sets, we find that a simple recursive greedy algorithm is the sole algorithm that does not fail any of our tests and also that it can be orders of magnitude faster to run than some competitors.

Problem

Research questions and friction points this paper is trying to address.

Detecting non-robustness in data analysis conclusions

Evaluating approximations for worst-case data dropping

Assessing algorithm performance in OLS linear regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive greedy algorithm detects non-robustness

Approximations fail in OLS linear regression

Efficient solution for worst-case data dropping

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis