Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses key challenges in banking reputation monitoring—namely, keyword extraction, irrelevant comment filtering, and fine-grained aspect-based sentiment classification—for low-resource, code-mixed Sinhala–English texts. To this end, we propose a domain-adaptive NLP framework. Methodologically: (1) we introduce the first XLM-RoBERTa encoder enhanced with a financial-domain lexicon for robust code-mixed representation; (2) we design a multi-model ensemble for keyword extraction, integrating SpaCy NER, KeyBERT, YAKE, and EmbedRank; and (3) we empirically demonstrate that fine-tuned Transformer models significantly outperform GPT-4o, SVM, and rule-based baselines on low-resource financial NLP tasks. Experimental results show keyword extraction accuracy of 91.2% (English) and 87.4% (Sinhala/code-mixed), irrelevant comment filtering F1-scores of 85.2%–88.1%, and aspect-level sentiment classification F1-scores of 87.4%–85.9%, consistently surpassing all baselines.

Technology Category

Application Category

📝 Abstract
Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.
Problem

Research questions and friction points this paper is trying to address.

Improves keyword extraction in Sinhala-English code-mixed content
Enhances aspect classification for multilingual banking texts
Filters irrelevant comments in low-resource language datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid NLP method for multilingual content analysis
Fine-tuned XLM-RoBERTa with domain-specific vocabulary
BERT and XLM-RoBERTa outperform GPT-4 in classification
🔎 Similar Papers
No similar papers found.
F
F. A. Rizvi
Department of Computer Science, Sri Lanka Institute of Information Technology, Colombo, Sri Lanka
T
T. Navojith
Department of Computer Science, Sri Lanka Institute of Information Technology, Colombo, Sri Lanka
A
A.M.N.H. Adhikari
Department of Computer Science, Sri Lanka Institute of Information Technology, Colombo, Sri Lanka
W
W.P.U. Senevirathna
Department of Computer Science, Sri Lanka Institute of Information Technology, Colombo, Sri Lanka
Dharshana Kasthurirathna
Dharshana Kasthurirathna
Assistant Professor, Sri Lanka Institute of Information Technology (SLIIT)
Evolutionary Game TheoryNetwork ScienceMachine LearningEvolutionary ComputingDistributed Systems
Lakmini Abeywardhana
Lakmini Abeywardhana
Lecturer, Sri Lanka Institute of Information Technology
Machine learningImage processingAutomated species identification