🤖 AI Summary
In physical sciences, time-series annotation faces challenges including scarcity of expert annotators, high annotation costs, and poor inter-annotator consistency—limiting both interpretability and predictive performance of machine learning models. To address this, we propose CIPHER, a scalable and interpretable framework that integrates iSAX-based symbolic indexing, HDBSCAN-based unsupervised clustering, and human-in-the-loop verification. Its core innovation lies in the tight coupling of symbolic representation with density-based clustering, augmented by expert-guided closed-loop feedback for label propagation and uncertainty calibration. Evaluated on OMNI space weather data, CIPHER automatically identifies critical phenomena—including coronal mass ejections (CMEs) and corotating interaction regions (CIRs)—achieving high accuracy and strong reproducibility. The framework demonstrates robust generalizability across domains, offering a principled approach to systematic, knowledge-informed time-series annotation.
📝 Abstract
Labeling or classifying time series is a persistent challenge in the physical sciences, where expert annotations are scarce, costly, and often inconsistent. Yet robust labeling is essential to enable machine learning models for understanding, prediction, and forecasting. We present the extit{Clustering and Indexation Pipeline with Human Evaluation for Recognition} (CIPHER), a framework designed to accelerate large-scale labeling of complex time series in physics. CIPHER integrates extit{indexable Symbolic Aggregate approXimation} (iSAX) for interpretable compression and indexing, density-based clustering (HDBSCAN) to group recurring phenomena, and a human-in-the-loop step for efficient expert validation. Representative samples are labeled by domain scientists, and these annotations are propagated across clusters to yield systematic, scalable classifications. We evaluate CIPHER on the task of classifying solar wind phenomena in OMNI data, a central challenge in space weather research, showing that the framework recovers meaningful phenomena such as coronal mass ejections and stream interaction regions. Beyond this case study, CIPHER highlights a general strategy for combining symbolic representations, unsupervised learning, and expert knowledge to address label scarcity in time series across the physical sciences. The code and configuration files used in this study are publicly available to support reproducibility.