BDA: Bangla Text Data Augmentation Framework

📅 2024-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity and poor generalization in low-resource Bangla text classification, this paper proposes a dual-path semantic augmentation framework integrating pretrained language models with rule-based strategies. The framework introduces a novel rewriting mechanism that jointly preserves semantic consistency and lexical diversity, combining mBERT/BanglaBERT encodings with synonym substitution and word-order perturbation, augmented by a lightweight semantic similarity filter and a context-aware consistency verifier. Experimental evaluation across five standard Bangla text classification benchmarks demonstrates that the method achieves peak performance—matching the full-data F1 score of 100%—using only 50% of the original training data. Moreover, it maintains robust performance gains under increasingly sparse data conditions. This work provides the first systematic empirical validation of high-quality data augmentation’s effectiveness and feasibility for low-resource Bangla text classification.

Technology Category

Application Category

📝 Abstract
Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework's effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA's performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.
Problem

Research questions and friction points this paper is trying to address.

Bangla language
text classification
model generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

BDA Framework
Bangla Language Processing
Data Augmentation for Text Classification
🔎 Similar Papers
No similar papers found.
M
Md. Tariquzzaman
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, K B Bazar Road, Gazipur, 1704, Bangladesh
A
A. Anam
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, K B Bazar Road, Gazipur, 1704, Bangladesh
N
Naimul Haque
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, K B Bazar Road, Gazipur, 1704, Bangladesh
Mohsinul Kabir
Mohsinul Kabir
PhD Candidate at the University of Manchester
NLPHCIAI
Hasan Mahmud
Hasan Mahmud
Postdoctoral Research Associate, Rochester Institute of Technology
Information SystemsAlgorithmic decision-makingHCI/Human-AI interaction
M
Md. Kamrul Hasan
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, K B Bazar Road, Gazipur, 1704, Bangladesh