🤖 AI Summary
This work investigates whether multilingual language models (mLMs) achieve cross-lingual understanding based on subword-level semantic concepts rather than surface-form similarity. To this end, we propose *semantic tokens*: subword units formed by clustering semantically similar synonyms and cross-lingual translation equivalents, enabling explicit semantic alignment; we further quantify the strength of shared representations for these units in the embedding space. Our method provides the first systematic evaluation of how subword-level semantic sharing influences the representational capacity of multilingual encoders, and is compatible with diverse tokenizers and model scales. Evaluated across five heterogeneous multilingual downstream tasks, semantic tokens significantly improve zero-shot cross-lingual transfer performance—achieving classification accuracy on par with or exceeding that of the original models. These results demonstrate that semantically grounded subword groupings serve as robust cross-lingual transfer anchors.
📝 Abstract
Human understanding of language is robust to different word choices as far as they represent similar semantic concepts. To what extent does our human intuition transfer to language models, which represent all subwords as distinct embeddings? In this work, we take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs). To this end, we form"semantic tokens"by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on 5 heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections on the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we found the zero-shot results with semantic tokens are on par or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transferring.