Majority or Minority: Data Imbalance Learning Method for Named Entity Recognition

📅 2024-01-21

🏛️ IEEE Access

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In named entity recognition (NER), severe class imbalance arises from the dominance of the “O” (non-entity) class and the long-tailed distribution of entity classes, hindering model generalization to rare entities. To address this, we propose the “Majority-or-Minority” (MoM) learning paradigm, featuring a lightweight loss correction mechanism: it explicitly incorporates only the loss from majority-class (“O”) samples—without resampling, loss reweighting, auxiliary modules, additional parameters, or preprocessing—and remains fully compatible with standard architectures (e.g., BERT, BiLSTM) and frameworks (sequence labeling, machine reading comprehension). Evaluated on four large-scale Japanese and English NER benchmarks, MoM consistently boosts minority-class F1 by +2.1–4.7 points on average, without degrading “O”-class performance. It demonstrates robust cross-lingual, cross-architectural, cross-framework, and cross-scale generalization, outperforming all existing state-of-the-art methods for imbalanced NER.

Technology Category

Application Category

📝 Abstract

Data imbalance presents a significant challenge in various machine learning (ML) tasks, particularly named entity recognition (NER) within natural language processing (NLP). NER exhibits a data imbalance with a long-tail distribution, featuring numerous minority classes (i.e., entity classes) and a single majority class (i.e., <inline-formula> <tex-math notation="LaTeX">$mathcal {O}$ </tex-math></inline-formula>-class). This imbalance leads to misclassifications of the entity classes as the <inline-formula> <tex-math notation="LaTeX">$mathcal {O}$ </tex-math></inline-formula>-class. To tackle this issue, we propose a simple and effective learning method named majority or minority (MoM) learning. MoM learning incorporates the loss computed only for samples whose ground truth is the majority class into the loss of the conventional ML model. Evaluation experiments on four NER datasets (Japanese and English) showed that MoM learning improves the prediction performance of the minority classes without sacrificing the performance of the majority class and is more effective than widely known and state-of-the-art methods. We also evaluated MoM learning using frameworks such as sequential labeling and machine reading comprehension, which are commonly used in NER. Furthermore, MoM learning has achieved consistent performance improvements regardless of language, model, framework, or data size.

Problem

Research questions and friction points this paper is trying to address.

Named Entity Recognition

Data Imbalance

Rare Name Identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Majority or Minority Learning (MoM)

Imbalanced Data Problem

Named Entity Recognition

🔎 Similar Papers

Augmenting NER Datasets with LLMs: Towards Automated and Refined Annotation