Less is More: The Effectiveness of Compact Typological Language Representations

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

High-dimensional sparsity in typological feature datasets (e.g., URIEL+) severely impairs the effectiveness of language distance metrics, especially for low-resource languages. Method: We propose an interpretable, compact language representation framework that jointly optimizes feature selection, missing-value imputation, and dimensionality reduction—preserving typological validity while drastically compressing the feature space. Contribution/Results: Contrary to the conventional assumption that higher dimensionality implies richer information, we empirically demonstrate that smaller, carefully selected feature subsets yield more discriminative and robust language distance estimates. Experimental results show that our compact representations significantly outperform the original high-dimensional features in both language distance alignment and downstream multilingual NLP tasks—including cross-lingual transfer and zero-shot parsing—achieving consistent gains in accuracy while enhancing interpretability and computational efficiency.

Technology Category

Application Category

📝 Abstract

Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.

Problem

Research questions and friction points this paper is trying to address.

High dimensionality and sparsity limit URIEL+ typological feature effectiveness

Optimizing compact typological representations through feature selection and imputation

Improving linguistic distance metrics and multilingual NLP application performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining feature selection and imputation

Producing compact interpretable typological representations

Improving multilingual NLP with reduced representations

🔎 Similar Papers

Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas