Statistical Inference for Clustering-based Anomaly Detection

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of statistical reliability in clustering-based unsupervised anomaly detection, which often leads to excessive false positives. We introduce selective inference—the first such application in clustering-based anomaly detection—to establish a statistically rigorous framework that strictly controls the false positive rate (FPR) at or below a user-specified significance level α. Methodologically, we model the post-clustering anomaly selection process to derive a conditional hypothesis test, enabling theoretically guaranteed FPR control; we further propose strategies to enhance true positive rate (TPR). Experiments on synthetic and real-world benchmarks demonstrate precise and stable FPR control at the target level (e.g., α = 0.05), while achieving significantly higher TPR than state-of-the-art methods—thus reconciling statistical validity with detection performance.

Technology Category

Application Category

📝 Abstract
Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $alpha$ (e.g., $alpha = 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Ensures reliability of clustering-based anomaly detection results
Controls false anomaly detection probability rigorously
Boosts true detection rate for improved performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical framework for clustering-based anomaly detection
Controls false anomaly detection probability rigorously
Boosts true detection rate with Selective Inference
🔎 Similar Papers
No similar papers found.
N
Nguyen Thi Minh Phu
University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam.
D
Duong Tan Loc
University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam.
Vo Nguyen Le Duy
Vo Nguyen Le Duy
Lecturer at University of Information Technology / Visiting Scientist at RIKEN
Machine LearningData ScienceStatistics