🤖 AI Summary
Current multilingual named entity recognition (NER) research is hindered by the lack of systematically constructed, reusable large-scale annotated datasets. To address this, we introduce the first scalable synthetic NER dataset covering 91 languages and 25 scripts. Our method leverages FineWeb-Edu to filter NER-relevant texts, employs multilingual large language models (LLMs) for automated annotation, and integrates LLM-as-a-judge for quality assessment. We further propose a novel regression-model pre-screening mechanism to enhance annotation fidelity (3.99/5) and completeness (4.05/5). The dataset comprises 225K text segments and 235K entities, featuring bilingual (source + English) standardized labels—a first in multilingual NER. Using a teacher-student paradigm with cross-lingual label alignment, our approach achieves zero-shot transfer performance on English, Thai, and Swahili that matches or surpasses baselines using only 1/19 of their training data; the regression model attains an F1 score of 84.1%. The full dataset and toolchain are open-sourced.
📝 Abstract
Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.