CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Scientific Discovery

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key challenge in data-driven scientific discovery: existing semi-supervised clustering methods often struggle to simultaneously recognize known classes and detect unknown anomalies due to reliance on global constraints or predefined cluster counts. To overcome this limitation, the authors propose CLiMB, a novel framework that decouples prior knowledge utilization from the exploration of new structures. In its first stage, CLiMB anchors known clusters using constrained clustering; in the second, it applies density-based clustering to the residual data—without requiring a preset number of clusters—to uncover arbitrarily shaped, previously unknown structures, guided by domain-specific priors from astronomy. Evaluated on Gaia DR3 RR Lyrae star data, CLiMB achieves an adjusted Rand index of 0.829 with 90% seed coverage, significantly outperforming baseline methods and successfully identifying three unlabeled dynamical structures: Shiva, Shakti, and the Galactic disk.

Technology Category

Application Category

📝 Abstract
In data-driven scientific discovery, a challenge lies in classifying well-characterized phenomena while identifying novel anomalies. Current semi-supervised clustering algorithms do not always fully address this duality, often assuming that supervisory signals are globally representative. Consequently, methods often enforce rigid constraints that suppress unanticipated patterns or require a pre-specified number of clusters, rendering them ineffective for genuine novelty detection. To bridge this gap, we introduce CLiMB (CLustering in Multiphase Boundaries), a domain-informed framework decoupling the exploitation of prior knowledge from the exploration of unknown structures. Using a sequential two-phase approach, CLiMB first anchors known clusters using constrained partitioning, and subsequently applies density-based clustering to residual data to reveal arbitrary topologies. We demonstrate this framework on RR Lyrae stars data from the Gaia Data Release 3. CLiMB attains an Adjusted Rand Index of 0.829 with 90% seed coverage in recovering known Milky Way substructures, drastically outperforming heuristic and constraint-based baselines, which stagnate below 0.20. Furthermore, sensitivity analysis confirms CLiMB's superior data efficiency, showing monotonic improvement as knowledge increases. Finally, the framework successfully isolates three dynamical features (Shiva, Shakti, and the Galactic Disk) in the unlabelled field, validating its potential for scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

novelty detection
semi-supervised clustering
scientific discovery
anomaly identification
domain-informed clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

novelty detection
semi-supervised clustering
density-based clustering
domain-informed learning
scientific discovery
🔎 Similar Papers
No similar papers found.
Lorenzo Monti
Lorenzo Monti
INAF - Istituto Nazionale di Astrofisica
Deep LearningAstronomyRR LyraeEdgeML
T
T. Muraveva
National Institute for Astrophysics, Osservatorio di Astrofisica e Scienza dello Spazio, Bologna, via Piero Gobetti 93/3, 40129, Italy
B
Brian Sheridan
TOELT LLC, Machine Learning Research and Development Department, Winterthur 8406, Zurich, Switzerland
D
Davide Massari
National Institute for Astrophysics, Osservatorio di Astrofisica e Scienza dello Spazio, Bologna, via Piero Gobetti 93/3, 40129, Italy
A
A. Garofalo
National Institute for Astrophysics, Osservatorio di Astrofisica e Scienza dello Spazio, Bologna, via Piero Gobetti 93/3, 40129, Italy
G
G. Clementini
National Institute for Astrophysics, Osservatorio di Astrofisica e Scienza dello Spazio, Bologna, via Piero Gobetti 93/3, 40129, Italy
U
Umberto Michelucci
TOELT LLC, Machine Learning Research and Development Department, Winterthur 8406, Zurich, Switzerland; Computer Science Department, Lucerne University of Applied Sciences and Arts, Luzern 6002, Switzerland