🤖 AI Summary
To address efficient change detection in online data streams—e.g., for predictive maintenance, fraud detection, and medical monitoring—this paper proposes a sublinear-time online two-sample test based on the Maximum Mean Discrepancy (MMD). The method couples the kernelized MMD statistic with an exponentially weighted sliding window, enabling incremental updates to the test statistic. This design reduces time complexity from the standard quadratic O(n²) to polylogarithmic and compresses memory usage to logarithmic scale, thereby overcoming computational bottlenecks of nonparametric distributional discrepancy measures in streaming settings. Evaluated on standard data stream benchmarks, the approach achieves significantly improved detection sensitivity and latency while maintaining rigorous statistical reliability (e.g., controlled false positive rate under the null). It provides a scalable, theoretically grounded solution for real-time anomaly detection under resource constraints.
📝 Abstract
Detecting changes is of fundamental importance when analyzing data streams and has many applications, e.g., in predictive maintenance, fraud detection, or medicine. A principled approach to detect changes is to compare the distributions of observations within the stream to each other via hypothesis testing. Maximum mean discrepancy (MMD), a (semi-)metric on the space of probability distributions, provides powerful non-parametric two-sample tests on kernel-enriched domains. In particular, MMD is able to detect any disparity between distributions under mild conditions. However, classical MMD estimators suffer from a quadratic runtime complexity, which renders their direct use for change detection in data streams impractical. In this article, we propose a new change detection algorithm, called Maximum Mean Discrepancy on Exponential Windows (MMDEW), that combines the benefits of MMD with an efficient computation based on exponential windows. We prove that MMDEW enjoys polylogarithmic runtime and logarithmic memory complexity and show empirically that it outperforms the state of the art on benchmark data streams.