🤖 AI Summary
Accurately characterizing the intrinsic dimensionality (iD) of astronomical datasets—particularly Radio Galaxy Zoo (RGZ)—remains challenging, yet iD is critical for assessing representation quality, self-supervised learning efficacy, and anomaly detection. Method: We estimate iD using a score-based diffusion model and systematically analyze its relationships with Bayesian neural network (BNN) energy scores, Fanaroff-Riley (FR) morphological classes (FR I vs. FR II), and signal-to-noise ratio (SNR). Contribution/Results: We report the first empirical evidence that out-of-distribution radio sources exhibit significantly higher iD than in-distribution ones, and that RGZ’s overall iD exceeds that of natural image datasets. iD shows a strong negative correlation with BNN energy scores and a weak negative correlation with SNR; however, no statistically significant difference in iD is observed between FR I and FR II sources. This work establishes a novel paradigm for evaluating representation quality in astrophysical data and extends the theoretical interpretability and practical applicability of iD in self-supervised learning and anomaly detection.
📝 Abstract
In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.