🤖 AI Summary
To address the challenge of detecting weak signals and rare anomalies in high-dimensional data, this paper proposes a sparse, self-organizing local kernel anomaly detection framework. The method adaptively partitions statistically imbalanced regions in the representation space, enabling efficient anomaly localization under extremely low supervision—e.g., fully unsupervised or minimally labeled settings. Innovatively integrating sparsity, locality, and competitive learning, it constructs an interpretable and scalable self-organizing kernel model. Technical components include Gaussian kernel ensembles, semi-supervised Neyman–Pearson learning, and local likelihood ratio modeling to precisely characterize statistical discrepancies between anomalous and normal samples. Experiments demonstrate that the framework achieves accurate identification of statistically significant anomalous regions in thousands-dimensional spaces using only a small number of kernels. It substantially outperforms state-of-the-art methods on scientific discovery, novelty detection, intrusion identification, and generative model validation tasks.
📝 Abstract
Modern artificial intelligence has revolutionized our ability to extract rich and versatile data representations across scientific disciplines. Yet, the statistical properties of these representations remain poorly controlled, causing misspecified anomaly detection (AD) methods to falter. Weak or rare signals can remain hidden within the apparent regularity of normal data, creating a gap in our ability to detect and interpret anomalies. We examine this gap and identify a set of structural desiderata for detection methods operating under minimal prior information: sparsity, to enforce parsimony; locality, to preserve geometric sensitivity; and competition, to promote efficient allocation of model capacity. These principles define a class of self-organizing local kernels that adaptively partition the representation space around regions of statistical imbalance. As an instantiation of these principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained within a semi-supervised Neyman--Pearson framework to locally model the likelihood ratio between a sample that may contain anomalies and a nominal, anomaly-free reference. We provide theoretical insights into the mechanisms that drive detection and self-organization in the proposed model, and demonstrate the effectiveness of this approach on realistic high-dimensional problems of scientific discovery, open-world novelty detection, intrusion detection, and generative-model validation. Our applications span both the natural- and computer-science domains. We demonstrate that ensembles containing only a handful of kernels can identify statistically significant anomalous locations within representation spaces of thousands of dimensions, underscoring both the interpretability, efficiency and scalability of the proposed approach.