🤖 AI Summary
This study systematically evaluates the zero-shot and fine-tuned transliteration capabilities of general-purpose large language models (LLMs) on cross-script transliteration for ten Indian languages (e.g., Hindi), benchmarking against the specialized model IndicXlit. Using the Dakshina and Aksharantar datasets, we measure performance via Top-1 accuracy and character error rate, incorporating zero-shot/few-shot prompting, supervised fine-tuning, and noise robustness analysis. Our key findings are threefold: (1) General LLMs—including GPT-4o—achieve zero-shot performance superior to IndicXlit across most languages, despite lacking domain-specific pretraining; (2) Supervised fine-tuning yields up to a 4.2% absolute accuracy gain; and (3) These models maintain >85% robustness under spelling noise—significantly outperforming IndicXlit. Collectively, results demonstrate that general LLMs possess strong zero-shot generalization for phonetic transliteration and can be efficiently adapted to specific languages via lightweight fine-tuning, establishing a new paradigm for low-resource language transliteration.
📝 Abstract
Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.