Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This study addresses the challenges of fake news detection in Bengali, a low-resource language domain hindered by data scarcity and severe class imbalance. It presents the first systematic investigation into the potential of large language models for data augmentation in this task. Leveraging an instruction-tuned Gemma-3 27B IT model, the authors generate synthetic news articles via zero-shot and few-shot prompting, followed by a rigorous filtering pipeline incorporating semantic relevance checks, label consistency validation, and controlled subsampling—applied exclusively to the minority class. This approach yields a notable performance improvement, increasing the F1 score for fake news classification from 0.85 to 0.88. The work further contributes 4,545 high-quality synthetic samples and the complete implementation code, publicly released to support future research in under-resourced misinformation detection.
📝 Abstract
The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random versus similarity-based selection strategies. Our experiments show that augmenting only the minority class with a high augmentation rate and random subsampling yields the strongest gains, raising the Fake News F1 score from 0.85 to 0.88. To support reproducibility and further research in this low-resource domain, we publicly release 4,545 synthetically generated Bangla fake news samples along with our full implementation. These findings demonstrate that well-designed LLM-driven augmentation can significantly improve fake news detection in low resource settings and provide a practical foundation for advancing multilingual misinformation research.
Problem

Research questions and friction points this paper is trying to address.

Bangla
fake news detection
data scarcity
imbalanced datasets
low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based augmentation
Bangla fake news detection
synthetic data generation
semantic filtering
low-resource NLP