🤖 AI Summary
Low accuracy in cross-domain language identification (e.g., mismatch between song titles and user language in music requests), poor robustness of lightweight models (e.g., LangDetect, FastText), and high deployment costs of large language models (LLMs) hinder multilingual system performance. To address these challenges, this paper proposes PolyLingua—a lightweight Transformer-based language identification model. Its core innovation is a two-stage contrastive learning framework: instance-level separation enhances inter-class discriminability, while category-level alignment enforces semantic consistency across languages; an adaptive margin mechanism further improves fine-grained discrimination among typologically similar languages. Evaluated on the Amazon Massive and Song datasets, PolyLingua achieves 99.25% and 98.15% F1 scores, respectively—surpassing Sonnet 3.5 while reducing parameter count by 10×. The model delivers high accuracy, low inference latency, and strong adaptability to resource-constrained environments.
📝 Abstract
Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases--such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets--Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching)--PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.