CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of current language identification (LID) models on real-world web data—particularly for low-resource languages—and the substantial overestimation of their accuracy by prevailing evaluation protocols. To this end, the authors introduce CommonLID, a large-scale, community-curated LID benchmark comprising 109 languages, explicitly designed for web text and annotated through collaborative human effort. The study conducts a systematic evaluation of eight mainstream LID models across CommonLID and five widely used test sets. It is the first to expose the limitations of existing LID approaches in heterogeneous, noisy environments and to reveal the biases inherent in current evaluation practices. By providing a high-quality, representative, and open-source benchmark, this work establishes a more realistic foundation for future research in language identification.

Technology Category

Application Category

📝 Abstract
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
Problem

Research questions and friction points this paper is trying to address.

Language Identification
Web Data
Multilingual Corpora
Evaluation Benchmark
Under-served Languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

language identification
web data
benchmark dataset
low-resource languages
model evaluation
🔎 Similar Papers
No similar papers found.
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Principal Research Scientist, Common Crawl Foundation
Language modelingCorpus linguisticsNamed Entity RecognitionComputational LinguisticsMachine
L
Laurie Burchell
Common Crawl Foundation
Catherine Arnett
Catherine Arnett
Researcher, EleutherAI
NLPmultilingual NLPcomputational linguistcs
R
Rafael Mosquera-Gómez
Factored AI
S
Sara Hincapie-Monsalve
Factored AI
T
Thom Vaughan
Common Crawl Foundation
D
Damian Stewart
Common Crawl Foundation
Malte Ostendorff
Malte Ostendorff
University of Göttingen / German Research Center for Artificial Intelligence
Large language modelsRecommender systemsInformation retrieval
Idris Abdulmumin
Idris Abdulmumin
Postdoctoral Fellow, DSFSI, University of Pretoria
Machine TranslationNeural Machine TranslationNatural Language ProcessingInternet Technology
Vukosi Marivate
Vukosi Marivate
University of Pretoria, Lelapa AI, Deep Learning Indaba, Masakhane Research Foundation
Data ScienceNatural Language ProcessingMachine LearningArtificial IntelligenceReinforcement
Shamsuddeen Hassan Muhammad
Shamsuddeen Hassan Muhammad
Bayero University, Kano, & Google DeepMind Academic Fellow at Imperial College London
Natural Language ProcessingSentiment AnalysisAfricaNLPLow-resource NLPMultilinguality
Atnafu Lambebo Tonja
Atnafu Lambebo Tonja
Postdoc at MBZUAI
NLP for low-resource languagesMultilingual language modelsSpeech Technology
H
Hend Al-Khalifa
King Saud University
N
Nadia Ghezaiel Hammouda
University of Hail
V
Verrah Otiende
USIU-Africa
T
Tack Hwa Wong
Universiti Teknologi PETRONAS
J
Jakhongir Saydaliev
EPFL
M
Melika Nobakhtian
Tehran Institute for Advanced Studies
M
Muhammad Ravi Shulthan Habibi
Universitas Indonesia
C
Chalamalasetti Kranti
University of Potsdam
C
Carol Muchemi
Universität Trier
Khang Nguyen
Khang Nguyen
UIT
Artificial IntelligenceComputer VisionTimetabling
F
Faisal Muhammad Adam
NOUN (ACETEL)
L
Luis Frentzen Salim
Academia Sinica
Reem Alqifari
Reem Alqifari
King Saud University
Natural Language Processing
C
Cynthia Amol
Maseno University
J
Joseph Marvin Imperial
University of Bath
I
Ilker Kesen
University of Copenhagen
A
Ahmad Mustafid
Independent
P
Pavel Stepachev
University of Edinburgh
Leshem Choshen
Leshem Choshen
MIT, IBM AI research
Model RecyclingEvolving Collaborative PretrainingEvaluationModel MergingOpen the Black Box
David Anugraha
David Anugraha
Stanford University
Machine LearningNatural Language ProcessingMultimodalityArtificial Intelligence
Hamada Nayel
Hamada Nayel
Computer Science Department, Faculty of Computers and AI, Benha University
Natural Language ProcessingText MiningMachine LearningBiomedical Text MiningArabic NLP
S
Seid Muhie Yimam
University of Hamburg
V
Vallerie Alexandra Putra
Bina Nusantara University
M
My Chiffon Nguyen
SEACrowd
Azmine Toushik Wasi
Azmine Toushik Wasi
Shahjalal University of Science and Technology
Machine LearningAI Agents & ReasoningHealth InformaticsGraph Neural NetworksHCI-HAI & Safety
G
Gouthami Vadithya
University of New Haven
R
Rob van der Goot
IT University of Copenhagen
L
Lanwenn ar C’horr
Ofis Publik ar Brezhoneg
Karan Dua
Karan Dua
Senior Applied Scientist
Computer VisionNLPSynthetic Data GenerationMLOpsGenerative ML
Andrew Yates
Andrew Yates
Johns Hopkins University, Human Language Technology Center of Excellence
Information RetrievalNLPAI
M
Mithil Bangera
University of New Haven
Yeshil Bangera
Yeshil Bangera
University of New Haven
Machine LearningDeep LearningData EngineeringData Analytics
Hitesh Laxmichand Patel
Hitesh Laxmichand Patel
Oracle
Large Language ModelMachine learningDeep learningComputer visionGenerative modeling
S
Shu Okabe
TUM Heilbronn
F
Fenal Ashokbhai Ilasariya
Stevens Institute of Technology
D
Dmitry Gaynullin
Independent
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundations
MultilingualityLanguage ModelingMultimodalLow-resource NLPCode-Switching
Yiyuan Li
Yiyuan Li
University of North Carolina at Chapel Hill
Natural Language ProcessingComputational Linguistics
Juan Pablo Martínez
Juan Pablo Martínez
Instituto de Investigación en Ingeniería de Aragón. Universidad de Zaragoza.
Signal Processingbiomedical engineeringECGelectrocardiology
A
Amit Agarwal
Liverpool John Moores University
I
Ikhlasul Akmal Hanif
MBZUAI
R
Raia Abu Ahmad
DFKI Berlin
E
Esther Adenuga
The African Research Collective
F
Filbert Aurelian Tjiaranata
Universitas Indonesia
W
Weerayut Buaphet
Vidyasirimedhi Institute of Science and Technology
M
Michael Anugraha
Independent
Sowmya Vajjala
Sowmya Vajjala
National Research Council, Canada
Natural Language Processing
B
Benjamin Rice
Princeton University
A
Azril Hafizi Amirudin
University of The People
J
Jesujoba O. Alabi
Saarland University
Srikant Panda
Srikant Panda
Oracle Cloud Infrastructure
Accessibility AIMultimodal Learning
Y
Yassine Toughrai
LORIA
B
Bruhan Kyomuhendo
University of Pretoria
D
Daniel Ruffinelli
University of Mannheim
A
Akshata A
Independent
M
Manuel Goulão
NeuralShift
E
Ej Zhou
University of Cambridge
I
Ingrid Gabriela Franco Ramirez
Independent
C
Cristina Aggazzotti
Johns Hopkins University
Konstantin Dobler
Konstantin Dobler
Hasso Plattner Institute
Transfer LearningNatural Language Processing
J
Jun Kevin
Universitas Pelita Harapan
Q
Quentin Pagès
Independent
Nicholas Andrews
Nicholas Andrews
Johns Hopkins University
natural language processingmachine learning
N
Nuhu Ibrahim
University of Manchester
M
Mattes Ruckdeschel
IT University of Copenhagen
A
Amr Keleg
MBZUAI
Mike Zhang
Mike Zhang
Aalborg University (Copenhagen)
Artificial IntelligenceNatural Language ProcessingInformation ExtractionNLP Applications
C
Casper Muziri
University of Pretoria
Saron Samuel
Saron Samuel
Stanford University
S
Sotaro Takeshita
University of Mannheim
K
Kun Kerdthaisong
Thammasat University
L
Luca Foppiano
ScienciaLAB
R
Rasul Dent
Inria Paris
T
Tommaso Green
University of Mannheim
A
Ahmad Mustapha Wali
University of Bucharest
K
Kamohelo Makaaka
University of Pretoria
V
Vicky Feliren
Monash University, Indonesia
I
Inshirah Idris
Wadmedani Ahlia University
Hande Celikkanat
Hande Celikkanat
Common Crawl Foundation
uncertainty-aware datauncertainty-aware evaluationbayesian DLfast optimized inference
A
Abdulhamid Abubakar
NSUK
Jean Maillard
Jean Maillard
Meta AI
Natural Language ProcessingComputational LinguisticsMachine LearningDeep Learning
Benoît Sagot
Benoît Sagot
Directeur de recherches at Inria, head of the ALMAnaCH team
NLPLanguage ModellingLow-resource LanguagesMachine TranslationComputational Linguistics
T
Thibault Clérice
Inria Paris
Kenton Murray
Kenton Murray
Research Scientist, Johns Hopkins
Machine LearningNatural Language ProcessingMachine TranslationSemanticsNeural Networks
S
Sarah Luger
Common Crawl Foundation