Stemming -- The Evolution and Current State with a Focus on Bangla

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Bangla, the world’s seventh most spoken language, suffers from severe data scarcity, high morphological richness, and substantial dialectal variation—hindering progress in fundamental NLP tasks such as stemming. This paper systematically surveys existing stemming approaches and identifies critical gaps in reproducibility, evaluation metric relevance, and cross-dialect robustness. To address these challenges, we propose a principled development pathway for stemmers targeting low-resource, inflectional languages: integrating linguistic morphology with language-specific modeling; rigorously comparing rule-based, statistical, and machine learning methods; designing a standardized, Bangla-aware evaluation framework; and releasing an open-source, fully reproducible toolchain. Our analysis pinpoints key bottlenecks—including inadequate dialect coverage and inconsistent benchmarking—and advances both systematic research and engineering practice in Bangla NLP. The work establishes a robust foundation for downstream NLP applications, promoting scalability, interoperability, and methodological rigor in under-resourced language processing.

Technology Category

Application Category

📝 Abstract
Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
Problem

Research questions and friction points this paper is trying to address.

Addressing digital under-representation of Bangla language processing
Surveying stemming approaches for low-resource inflectional languages
Identifying gaps in Bangla stemming research and implementations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey of Bangla stemming approaches
Addressing morphological variant challenges
Proposing robust stemmer development directions
🔎 Similar Papers
No similar papers found.
A
Abhijit Paul
Inst. of Information Technology, University of Dhaka, Bangladesh
M
Mashiat Amin Farin
Inst. of Information Technology, University of Dhaka, Bangladesh
S
Sharif Md. Abdullah
Inst. of Information Technology, University of Dhaka, Bangladesh
Ahmedul Kabir
Ahmedul Kabir
Associate Professor, IIT, University of Dhaka
NLPAI/MLHealth InformaticsSoftware Analytics
Z
Zarif Masud
Toronto Metropolitan University, Canada
S
Shebuti Rayana
State University of New York, Old Westbury, USA