Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This study investigates how writing system similarity, vocabulary overlap, and phonological commonality affect transliteration performance in multilingual language models, particularly addressing representational bottlenecks for non-Latin-script languages in NLP. We propose and systematically evaluate three transliteration strategies—romanization, phonemic transcription, and substitution cipher—under controlled orthographic conditions, and conduct cross-lingual experiments on named entity recognition (NER) and natural language inference (NLI). Results show that romanization significantly outperforms other input formats in 7 out of 8 evaluation settings, confirming its efficacy as the optimal transliteration strategy. Moreover, we provide the first empirical evidence that subword unit sharing correlates positively with model performance, and that phoneme-level transcription consistently surpasses both grapheme-level and word-level inputs. This work establishes an interpretable, phonology-driven optimization pathway for language adaptation in multilingual models.

Technology Category

Application Category

📝 Abstract

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

Problem

Research questions and friction points this paper is trying to address.

Investigating how transliteration bridges multilingual script gaps

Evaluating romanization's effectiveness across NLP tasks

Analyzing shared vocabulary impact on model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Romanization outperforms other transliteration methods significantly

Shared subword tokens enhance model utilization effectively

Controlled experiments test script, vocabulary, and phonology contributions

🔎 Similar Papers

No similar papers found.