🤖 AI Summary
Traditional typological inventories suffer from feature sparsity and static rigidity, failing to capture continuous and dynamic cross-linguistic typological relationships. To address this, we propose a novel paradigm for language representation grounded in prediction entropy derived from monolingual language models: dense, end-to-end learnable, and missingness-free language embeddings are generated by computing token-level predictive distribution entropy over standard monolingual corpora. This approach requires no manual annotation or cross-lingual alignment and natively supports diachronic modeling. Experiments demonstrate that the resulting embeddings exhibit strong agreement with authoritative typological classifications and achieve state-of-the-art performance on multilingual NLP tasks—particularly within the LinguAlchemy framework—significantly outperforming conventional discrete-feature-based methods.
📝 Abstract
We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.