Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Traditional typological inventories suffer from feature sparsity and static rigidity, failing to capture continuous and dynamic cross-linguistic typological relationships. To address this, we propose a novel paradigm for language representation grounded in prediction entropy derived from monolingual language models: dense, end-to-end learnable, and missingness-free language embeddings are generated by computing token-level predictive distribution entropy over standard monolingual corpora. This approach requires no manual annotation or cross-lingual alignment and natively supports diachronic modeling. Experiments demonstrate that the resulting embeddings exhibit strong agreement with authoritative typological classifications and achieve state-of-the-art performance on multilingual NLP tasks—particularly within the LinguAlchemy framework—significantly outperforming conventional discrete-feature-based methods.

Technology Category

Application Category

📝 Abstract

We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

Problem

Research questions and friction points this paper is trying to address.

Develop cross-lingual language representations using entropy

Overcome feature sparsity in traditional typological inventories

Capture structural similarity between languages through model uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses monolingual language model entropy

Captures typological relationships via uncertainty

Produces dense non-sparse language embeddings

🔎 Similar Papers

No similar papers found.