🤖 AI Summary
Language identification for non-Latin-script languages (e.g., Hindi, Urdu) is severely hindered by high orthographic variability in informal Romanized text. Method: This paper proposes a synthetic data generation approach grounded in natural spelling variation modeling, integrating rule-based and statistical strategies to emulate authentic variation patterns without supervision; it further employs a lightweight linear classifier and benchmark-specific adaptation techniques (Bhasha-Abhijnaanam) for efficient training. Contribution/Results: We provide the first empirical evidence that high-quality synthetic data alone—without larger models or real annotated data—is sufficient to significantly advance Romanized language identification performance. On a benchmark comprising 20 Indian languages, our method achieves an F1 score of 88.2%, outperforming the prior state-of-the-art by 13.5 percentage points; remarkably, training exclusively on synthetic data yields 85.4% F1.
📝 Abstract
The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), there is no conventional spelling of words in the Latin script, hence there will be high spelling variability in written text. Such romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.