CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the significant performance degradation of current language identification (LID) models on real-world web data—particularly for low-resource languages—and the substantial overestimation of their accuracy by prevailing evaluation protocols. To this end, the authors introduce CommonLID, a large-scale, community-curated LID benchmark comprising 109 languages, explicitly designed for web text and annotated through collaborative human effort. The study conducts a systematic evaluation of eight mainstream LID models across CommonLID and five widely used test sets. It is the first to expose the limitations of existing LID approaches in heterogeneous, noisy environments and to reveal the biases inherent in current evaluation practices. By providing a high-quality, representative, and open-source benchmark, this work establishes a more realistic foundation for future research in language identification.

Technology Category

Application Category

📝 Abstract

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

Problem

Research questions and friction points this paper is trying to address.

Language Identification

Web Data

Multilingual Corpora

Evaluation Benchmark

Under-served Languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

language identification

web data

benchmark dataset