Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the scarcity of labeled training data for named entity recognition (NER) in low-resource languages. We systematically evaluate the efficacy of synthetically generated data across 11 typologically diverse low-resource languages, conducting the first unified comparative study of synthetic annotations derived from mBERT and XLM-R under both supervised fine-tuning and zero-shot transfer settings. Results show that synthetic data yields average NER F1 improvements of 3.2–9.7 points; critically, these gains are modulated by language family membership and morphological complexity—demonstrating a significant role for linguistic typology in determining synthesis effectiveness. Beyond confirming the practical utility of synthetic data for low-resource NER, this study provides the first empirical characterization of the relationship between language-specific properties and synthetic data performance. It thus establishes foundational theoretical insights and methodological guidelines for typology-aware data augmentation in multilingual NLP.

Technology Category

Application Category

📝 Abstract

Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

Problem

Research questions and friction points this paper is trying to address.

Exploring synthetic data's role in low-resource NER

Evaluating synthetic data impact across 11 diverse languages

Assessing performance variation in low-resource language NER

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic data for low-resource NER

Explores 11 diverse multilingual languages

Shows promise with significant language variation

🔎 Similar Papers

No similar papers found.