Lost in Transliteration: Bridging the Script Gap in Neural IR

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the “script gap” problem—where multilingual retrieval performance degrades significantly for non-Latin script users issuing transliterated queries (e.g., Greeklish, Arabizi)—this paper proposes the first translation-free “translate-then-train” paradigm tailored for transliteration scenarios. Our method jointly fine-tunes multilingual dense retrievers (e.g., BGE-M3) on mixed native-script and Latinized text corpora, integrating cross-script contrastive learning and relevance matching optimization. Crucially, it requires no transliteration dictionaries or external translation modules. Experiments demonstrate that our approach substantially bridges the representation gap between semantically equivalent queries across scripts: it achieves an average >40% improvement in Recall@10 on cross-script retrieval benchmarks, approaching native-script retrieval performance. Moreover, it exhibits strong zero-shot transfer capability to unseen language pairs and transliteration schemes, confirming its robustness and generalizability.

Technology Category

Application Category

📝 Abstract

Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated -- usually Latinized -- form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such as BGE-M3, do not generalise to this setting, and their performance rapidly deteriorates when exposed to transliterated queries. This creates a ``script gap"between the performance of the same queries when written in their native or transliterated form. We explore whether adapting the popular ``translate-train"paradigm to transliterations can enhance the robustness of multilingual Information Retrieval (IR) methods and bridge the gap between native and transliterated scripts. By exploring various combinations of non-Latin and Latinized query text for training, we investigate whether we can enhance the capacity of existing neural retrieval techniques and enable them to apply to this important setting. We show that by further fine-tuning IR models on an even mixture of native and Latinized text, they can perform this cross-script matching at nearly the same performance as when the query was formulated in the native script. Out-of-domain evaluation and further qualitative analysis show that transliterations can also cause queries to lose some of their nuances, motivating further research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Search systems fail with transliterated non-Latin queries

Script gap exists between native and transliterated query performance

Adapting translate-train for transliterations may improve IR robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapt translate-train paradigm for transliterations

Fine-tune IR models on mixed native and Latinized text

Enhance cross-script matching performance in IR

🔎 Similar Papers

No similar papers found.