🤖 AI Summary
This work addresses the lack of statistical reliability in clustering-based unsupervised anomaly detection, which often leads to excessive false positives. We introduce selective inference—the first such application in clustering-based anomaly detection—to establish a statistically rigorous framework that strictly controls the false positive rate (FPR) at or below a user-specified significance level α. Methodologically, we model the post-clustering anomaly selection process to derive a conditional hypothesis test, enabling theoretically guaranteed FPR control; we further propose strategies to enhance true positive rate (TPR). Experiments on synthetic and real-world benchmarks demonstrate precise and stable FPR control at the target level (e.g., α = 0.05), while achieving significantly higher TPR than state-of-the-art methods—thus reconciling statistical validity with detection performance.
📝 Abstract
Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $alpha$ (e.g., $alpha = 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.