Structure-based Anomaly Detection and Clustering

πŸ“… 2025-05-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses three core challenges: anomaly detection in structured and streaming data, structural clustering under noise, and open-set malware family identification. To tackle these, we propose a unified low-dimensional manifold modeling framework. For anomaly detection, we introduce Preference Isolation Forest (PIF) and its streaming variant, Sliding-PIF. For clustering multi-geometric structures corrupted by noise, we propose MultiLinkβ€”a model-aware hierarchical clustering algorithm leveraging Voronoi partitioning, locality-sensitive hashing (LSH), and manifold fitting. For evolving data streams, we design Online-iForest, enabling retraining-free online adaptation via sliding-window modeling. For open-set malware recognition, we integrate MaxLogit into open-set classification and deploy it in the industrial Cleafy system. Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance: MultiLink achieves superior robustness and efficiency; Online-iForest enables real-time detection; and MaxLogit is production-deployed.

Technology Category

Application Category

πŸ“ Abstract
Anomaly detection is a fundamental problem in domains such as healthcare, manufacturing, and cybersecurity. This thesis proposes new unsupervised methods for anomaly detection in both structured and streaming data settings. In the first part, we focus on structure-based anomaly detection, where normal data follows low-dimensional manifolds while anomalies deviate from them. We introduce Preference Isolation Forest (PIF), which embeds data into a high-dimensional preference space via manifold fitting, and isolates outliers using two variants: Voronoi-iForest, based on geometric distances, and RuzHash-iForest, leveraging Locality Sensitive Hashing for scalability. We also propose Sliding-PIF, which captures local manifold information for streaming scenarios. Our methods outperform existing techniques on synthetic and real datasets. We extend this to structure-based clustering with MultiLink, a novel method for recovering multiple geometric model families in noisy data. MultiLink merges clusters via a model-aware linkage strategy, enabling robust multi-class structure recovery. It offers key advantages over existing approaches, such as speed, reduced sensitivity to thresholds, and improved robustness to poor initial sampling. The second part of the thesis addresses online anomaly detection in evolving data streams. We propose Online Isolation Forest (Online-iForest), which uses adaptive, multi-resolution histograms and dynamically updates tree structures to track changes over time. It avoids retraining while achieving accuracy comparable to offline models, with superior efficiency for real-time applications. Finally, we tackle anomaly detection in cybersecurity via open-set recognition for malware classification. We enhance a Gradient Boosting classifier with MaxLogit to detect unseen malware families, a method now integrated into Cleafy's production system.
Problem

Research questions and friction points this paper is trying to address.

Proposes unsupervised anomaly detection in structured and streaming data
Introduces structure-based clustering for noisy multi-model data recovery
Develops online anomaly detection for evolving data streams efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference Isolation Forest for manifold-based anomaly detection
MultiLink clustering for multiple geometric model recovery
Online Isolation Forest for evolving data streams
πŸ”Ž Similar Papers
No similar papers found.