🤖 AI Summary
This paper addresses the fundamental challenge of reliably verifying the existence of anomalies in unlabeled data—particularly under extremely low contamination rates (ν), where anomalies are rare and difficult to detect.
Method: Through rigorous theoretical analysis and large-scale empirical validation involving 3 million statistical tests, the authors derive and quantify the sample complexity lower bound for anomaly existence testing.
Contribution/Results: They establish the first formal lower bound on the required sample size: (N geq alpha_{ ext{alg}} /
u^2), where (alpha_{ ext{alg}}) is an algorithm-dependent constant. This bound reveals an intrinsic limitation: when sample size falls below this threshold, anomaly existence is fundamentally unverifiable due to insufficient statistical power. It unifies the quantitative interplay among data scale, contamination rate, and algorithmic capability. The result provides the first theoretical feasibility criterion for anomaly detection, shifting the field’s foundational focus from “how to detect” to “whether detection is statistically confirmable.”
📝 Abstract
Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ α_{ ext{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ ν$, the condition $ N ge frac{α_{ ext{algo}}}{ν^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.