Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models

📅 2024-11-07

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

157K/year

🤖 AI Summary

This work investigates whether multilingual language models (mLMs) achieve cross-lingual understanding based on subword-level semantic concepts rather than surface-form similarity. To this end, we propose *semantic tokens*: subword units formed by clustering semantically similar synonyms and cross-lingual translation equivalents, enabling explicit semantic alignment; we further quantify the strength of shared representations for these units in the embedding space. Our method provides the first systematic evaluation of how subword-level semantic sharing influences the representational capacity of multilingual encoders, and is compatible with diverse tokenizers and model scales. Evaluated across five heterogeneous multilingual downstream tasks, semantic tokens significantly improve zero-shot cross-lingual transfer performance—achieving classification accuracy on par with or exceeding that of the original models. These results demonstrate that semantically grounded subword groupings serve as robust cross-lingual transfer anchors.

Technology Category

Application Category

📝 Abstract

Human understanding of language is robust to different word choices as far as they represent similar semantic concepts. To what extent does our human intuition transfer to language models, which represent all subwords as distinct embeddings? In this work, we take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs). To this end, we form"semantic tokens"by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on 5 heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections on the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we found the zero-shot results with semantic tokens are on par or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transferring.

Problem

Research questions and friction points this paper is trying to address.

Investigating multilingual models' understanding of subword-level semantic concepts

Evaluating merged semantic tokens on multilingual downstream tasks

Examining shared subword semantics as anchors for cross-lingual transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merging semantically similar subwords into semantic tokens

Evaluating updated models on multilingual downstream tasks

Using shared subword-level semantics for cross-lingual transfer

🔎 Similar Papers

Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs