๐ค AI Summary
Data stream anomaly detection demands both high accuracy and real-time processing under continuously evolving data distributionsโa challenge unmet by existing methods. This paper proposes a dynamic incremental detection framework based on kernel mean embedding (KME), the first to jointly integrate data-dependent kernels with inheritance-based isolation distribution modeling. It introduces a lightweight incremental update mechanism, theoretically guaranteeing statistical equivalence to full retraining. Crucially, the method requires no historical data storage and processes streams in a single pass. Extensive experiments across 13 standard benchmark datasets demonstrate that our approach achieves significantly higher average detection accuracy than state-of-the-art methods, while accelerating inference by approximately 9ร. It further exhibits strong robustness to concept drift and incurs low computational overhead, striking an unprecedented balance between accuracy, efficiency, and adaptability in streaming settings.
๐ Abstract
Anomaly detection on data streams presents significant challenges, requiring methods to maintain high detection accuracy among evolving distributions while ensuring real-time efficiency. Here we introduce $mathcal{IDK}$-$mathcal{S}$, a novel $mathbf{I}$ncremental $mathbf{D}$istributional $mathbf{K}$ernel for $mathbf{S}$treaming anomaly detection that effectively addresses these challenges by creating a new dynamic representation in the kernel mean embedding framework. The superiority of $mathcal{IDK}$-$mathcal{S}$ is attributed to two key innovations. First, it inherits the strengths of the Isolation Distributional Kernel, an offline detector that has demonstrated significant performance advantages over foundational methods like Isolation Forest and Local Outlier Factor due to the use of a data-dependent kernel. Second, it adopts a lightweight incremental update mechanism that significantly reduces computational overhead compared to the naive baseline strategy of performing a full model retraining. This is achieved without compromising detection accuracy, a claim supported by its statistical equivalence to the full retrained model. Our extensive experiments on thirteen benchmarks demonstrate that $mathcal{IDK}$-$mathcal{S}$ achieves superior detection accuracy while operating substantially faster, in many cases by an order of magnitude, than existing state-of-the-art methods.