🤖 AI Summary
High-dimensional sparsity in typological feature datasets (e.g., URIEL+) severely impairs the effectiveness of language distance metrics, especially for low-resource languages.
Method: We propose an interpretable, compact language representation framework that jointly optimizes feature selection, missing-value imputation, and dimensionality reduction—preserving typological validity while drastically compressing the feature space.
Contribution/Results: Contrary to the conventional assumption that higher dimensionality implies richer information, we empirically demonstrate that smaller, carefully selected feature subsets yield more discriminative and robust language distance estimates. Experimental results show that our compact representations significantly outperform the original high-dimensional features in both language distance alignment and downstream multilingual NLP tasks—including cross-lingual transfer and zero-shot parsing—achieving consistent gains in accuracy while enhancing interpretability and computational efficiency.
📝 Abstract
Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.