🤖 AI Summary
Traditional metrics—such as character overlap or distributional similarity—fail to characterize cross-lingual knowledge transfer capability between language pairs with highly divergent writing systems.
Method: This paper introduces *token alignability*, a novel subword-level metric quantifying the degree to which subword tokens across languages can be reliably aligned. We formally define and empirically validate token alignability as a core predictor of multilingual tokenization quality and cross-lingual transfer performance. Our methodology integrates subword-level alignment modeling, cross-lingual embedding space analysis, controlled encoder-decoder architecture comparisons, and ablation studies varying training data scale.
Results: Experiments demonstrate that token alignability significantly outperforms conventional metrics in predicting cross-lingual performance on non-overlapping script pairs (e.g., Chinese–English, Japanese–German). It enhances interpretability and practicality in tokenizer design and language-pair selection. The codebase and full reproducibility package are publicly released.
📝 Abstract
Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.