BDA: Bangla Text Data Augmentation Framework

📅 2024-12-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address data scarcity and poor generalization in low-resource Bangla text classification, this paper proposes a dual-path semantic augmentation framework integrating pretrained language models with rule-based strategies. The framework introduces a novel rewriting mechanism that jointly preserves semantic consistency and lexical diversity, combining mBERT/BanglaBERT encodings with synonym substitution and word-order perturbation, augmented by a lightweight semantic similarity filter and a context-aware consistency verifier. Experimental evaluation across five standard Bangla text classification benchmarks demonstrates that the method achieves peak performance—matching the full-data F1 score of 100%—using only 50% of the original training data. Moreover, it maintains robust performance gains under increasingly sparse data conditions. This work provides the first systematic empirical validation of high-quality data augmentation’s effectiveness and feasibility for low-resource Bangla text classification.

Technology Category

Application Category

📝 Abstract

Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework's effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA's performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.

Problem

Research questions and friction points this paper is trying to address.

Bangla language

text classification

model generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

BDA Framework

Bangla Language Processing

Data Augmentation for Text Classification

🔎 Similar Papers

No similar papers found.