RomanLens: Latent Romanization and its role in Multilinguality in LLMs

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how large language models (LLMs), trained predominantly on English, generalize to non-Latin-script languages—specifically examining their implicit reliance on “latent romanization.” Methodologically, we employ mechanistic interpretability techniques, including activation patching and layer-wise token generation tracing. Our key contribution is the first empirical demonstration that LLMs internally map non-Latin tokens to romanized forms in intermediate layers before reconstructing them into native script; semantic representations across native and romanized scripts exhibit high alignment; and romanized representations activate 2–4 layers earlier than native-script ones in translation tasks. These findings reveal a novel cross-script semantic sharing pathway, formally identify and validate the phenomenon of “implicit romanization,” and provide critical, interpretable evidence for understanding multilingual generalization in LLMs—highlighting a fundamental cognitive mechanism underlying cross-lingual transfer beyond explicit tokenization.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) exhibit remarkable multilingual generalization despite being predominantly trained on English-centric corpora. A fundamental question arises: how do LLMs achieve such robust multilingual capabilities? For non-Latin script languages, we investigate the role of romanization - the representation of non-Latin scripts using Latin characters - as a bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and romanized scripts, suggesting a shared underlying representation. Additionally in translation towards non Latin languages, our findings reveal that when the target language is in romanized form, its representations emerge earlier in the model's layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of romanization in facilitating language transfer. Our work provides new directions for potentially improving multilingual language modeling and interpretability.
Problem

Research questions and friction points this paper is trying to address.

Investigate role of romanization in multilingual LLMs
Analyze latent romanization in token generation
Explore shared semantic representations across scripts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Romanization facilitates multilingual processing
Shared semantic encoding across scripts
Romanized form accelerates representation in layers
🔎 Similar Papers
No similar papers found.
A
Alan Saji
Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India
Jaavid Aktar Husain
Jaavid Aktar Husain
Singapore University of Technology and Design
T
Thanmay Jayakumar
Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India
Raj Dabre
Raj Dabre
Researcher@NICT (Japan), Adjunct Faculty@IIT Madras/AI4Bharat (India)
Artificial IntelligenceMachine TranslationNatural Language ProcessingGenetics
Anoop Kunchukuttan
Anoop Kunchukuttan
Microsoft Translator, AI4Bharat
NLPMultilingual LearningInstruction TuningMTIndian language NLP
M
Mitesh M. Khapra
Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India
Ratish Puduppully
Ratish Puduppully
IT University of Copenhagen
Natural Language Processing