Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
URIEL+ suffers from data sparsity—including missing typological features, limited language coverage, and incomplete genealogical trees—hindering its effectiveness in cross-lingual transfer for low-resource languages. To address this, we propose a systematic enhancement framework: (1) integrating script embeddings for 7,488 languages; (2) augmenting with Glottolog to incorporate 18,710 additional language varieties; and (3) performing phylogeny-constrained propagation of typological and script features across 26,449 languages. This yields substantial improvements: script feature sparsity decreases by 14%; total language coverage expands to 19,015 (+1,007%); phylogenetic inference accuracy increases by 33%; and downstream cross-lingual transfer performance improves by up to 6%. Our core contribution is the first large-scale, phylogeny-aware, multimodal language feature completion framework—jointly imputing typological, script, and genealogical attributes under rigorous linguistic constraints.

Technology Category

Application Category

📝 Abstract
The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.
Problem

Research questions and friction points this paper is trying to address.

Addressing data sparsity in URIEL+ linguistic knowledge base
Expanding language coverage and feature representation for low-resource languages
Improving cross-lingual transfer through enhanced genealogical and typological imputation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Added script vectors for 7,488 languages
Integrated Glottolog to add 18,710 languages
Expanded lineage imputation for 26,449 languages
🔎 Similar Papers
No similar papers found.
M
Mason Shipton
Ontario Tech University
Y
York Hay Ng
University of Toronto
A
Aditya Khan
University of Toronto
P
Phuong Hanh Hoang
University of Toronto
Xiang Lu
Xiang Lu
Associate Professor, Institute of Information Engineering, Chinese Academy of Sciences
cyber securitycyber-physical systemwireless network security
A
A. Seza Doğruöz
LT3, IDLab, Universiteit Gent
En-Shiun Annie Lee
En-Shiun Annie Lee
Ontario Tech University, and University of Toronto (Status-Only)
Natural Language ProcessingData MiningPattern Analysis