Statistical Inference for Clustering-based Anomaly Detection

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the lack of statistical reliability in clustering-based unsupervised anomaly detection, which often leads to excessive false positives. We introduce selective inference—the first such application in clustering-based anomaly detection—to establish a statistically rigorous framework that strictly controls the false positive rate (FPR) at or below a user-specified significance level α. Methodologically, we model the post-clustering anomaly selection process to derive a conditional hypothesis test, enabling theoretically guaranteed FPR control; we further propose strategies to enhance true positive rate (TPR). Experiments on synthetic and real-world benchmarks demonstrate precise and stable FPR control at the target level (e.g., α = 0.05), while achieving significantly higher TPR than state-of-the-art methods—thus reconciling statistical validity with detection performance.

Technology Category

Application Category

📝 Abstract

Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $alpha$ (e.g., $alpha = 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

Ensures reliability of clustering-based anomaly detection results

Controls false anomaly detection probability rigorously

Boosts true detection rate for improved performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical framework for clustering-based anomaly detection

Controls false anomaly detection probability rigorously

Boosts true detection rate with Selective Inference

🔎 Similar Papers

No similar papers found.