🤖 AI Summary
Subspace clustering in high-dimensional data often yields multiple semantically distinct subspaces, yet existing methods require manual specification of both the number of subspaces and the number of clusters within each—rendering them parameter-sensitive and poorly interpretable. This paper proposes an automatic, non-redundant multi-subspace clustering framework. First, it introduces the Minimum Description Length (MDL) principle to non-redundant clustering, enabling joint, adaptive inference of both the optimal number of subspaces and the cluster count per subspace. Second, it designs a split-merge-based greedy search strategy coupled with a subspace-level outlier encoding mechanism, allowing simultaneous outlier detection. Evaluated on multiple benchmark datasets, the method achieves competitive accuracy against state-of-the-art approaches while significantly improving parameter robustness, model interpretability, and practical applicability.
📝 Abstract
High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.