RomanLens: Latent Romanization and its role in Multilinguality in LLMs

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study investigates how large language models (LLMs), trained predominantly on English, generalize to non-Latin-script languages—specifically examining their implicit reliance on “latent romanization.” Methodologically, we employ mechanistic interpretability techniques, including activation patching and layer-wise token generation tracing. Our key contribution is the first empirical demonstration that LLMs internally map non-Latin tokens to romanized forms in intermediate layers before reconstructing them into native script; semantic representations across native and romanized scripts exhibit high alignment; and romanized representations activate 2–4 layers earlier than native-script ones in translation tasks. These findings reveal a novel cross-script semantic sharing pathway, formally identify and validate the phenomenon of “implicit romanization,” and provide critical, interpretable evidence for understanding multilingual generalization in LLMs—highlighting a fundamental cognitive mechanism underlying cross-lingual transfer beyond explicit tokenization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit remarkable multilingual generalization despite being predominantly trained on English-centric corpora. A fundamental question arises: how do LLMs achieve such robust multilingual capabilities? For non-Latin script languages, we investigate the role of romanization - the representation of non-Latin scripts using Latin characters - as a bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and romanized scripts, suggesting a shared underlying representation. Additionally in translation towards non Latin languages, our findings reveal that when the target language is in romanized form, its representations emerge earlier in the model's layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of romanization in facilitating language transfer. Our work provides new directions for potentially improving multilingual language modeling and interpretability.

Problem

Research questions and friction points this paper is trying to address.

Investigate role of romanization in multilingual LLMs

Analyze latent romanization in token generation

Explore shared semantic representations across scripts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Romanization facilitates multilingual processing

Shared semantic encoding across scripts

Romanized form accelerates representation in layers

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models