Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models

📅 2023-07-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work addresses the core challenges of long-text modeling and scarce labeled data in low-resource Bangla fake news detection. We propose a synergistic paradigm of “summary compression + multi-strategy data augmentation”: (1) systematically comparing extractive vs. abstractive news summarization alongside back-translation and synonym replacement; and (2) introducing a novel three-stage pipeline—cross-lingual translation → localized augmentation → summary compression—to enhance cross-lingual generalization. Evaluating on five pre-trained models—including BanglaBERT and mBERT—we employ multi-stage fine-tuning and cross-test-set validation. Results show BanglaBERT with augmentation achieves 96% accuracy, while integrating summary compression with augmentation yields 97%; mBERT attains 86% accuracy on cross-domain generalization sets. All datasets and code are publicly released.
📝 Abstract
With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection
Problem

Research questions and friction points this paper is trying to address.

Detecting fake news in low-resource Bengali language
Addressing data scarcity using translation and augmentation
Overcoming token limitations with summarization techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmenting Bengali fake news data via translation
Summarizing articles to overcome BERT length limits
Leveraging pre-trained language models for classification
🔎 Similar Papers
No similar papers found.
A
Arman Sakif Chowdhury
Ahsanullah University of Science and Technology, Dhaka, Bangladesh
G
G. M. Shahariar
Ahsanullah University of Science and Technology, Dhaka, Bangladesh
A
Ahammed Tarik Aziz
Ahsanullah University of Science and Technology, Dhaka, Bangladesh
S
Syed Mohibul Alam
Ahsanullah University of Science and Technology, Dhaka, Bangladesh
M
Md. Azad Sheikh
Ahsanullah University of Science and Technology, Dhaka, Bangladesh
T
Tanveer Ahmed Belal
Ahsanullah University of Science and Technology, Dhaka, Bangladesh