🤖 AI Summary
This paper addresses the challenge of uncovering latent semantic correlations and estimating intrinsic dimensionality in high-dimensional binary data. To this end, it proposes Binary Intrinsic Dimension (BID) as a proxy measure for semantic complexity. Methodologically, it introduces a robust, unsupervised correlation detection framework resilient to the curse of dimensionality, integrating concepts from statistical physics, neuroscience, and deep learning—including BID estimation, sparse manifold analysis, phase-transition detection, and neural representation decomposition. Notably, this work pioneers the systematic incorporation of BID into semantic modeling, empirically validating its efficacy in identifying magnetic phase transitions and localizing semantic structures within CNNs and Transformers. Experiments demonstrate a strong empirical correlation between BID and semantic relatedness, establishing BID as an interpretable, scalable paradigm for semantic structure discovery across images, text, and physical systems.
📝 Abstract
In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.