Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This paper addresses key challenges in banking reputation monitoring—namely, keyword extraction, irrelevant comment filtering, and fine-grained aspect-based sentiment classification—for low-resource, code-mixed Sinhala–English texts. To this end, we propose a domain-adaptive NLP framework. Methodologically: (1) we introduce the first XLM-RoBERTa encoder enhanced with a financial-domain lexicon for robust code-mixed representation; (2) we design a multi-model ensemble for keyword extraction, integrating SpaCy NER, KeyBERT, YAKE, and EmbedRank; and (3) we empirically demonstrate that fine-tuned Transformer models significantly outperform GPT-4o, SVM, and rule-based baselines on low-resource financial NLP tasks. Experimental results show keyword extraction accuracy of 91.2% (English) and 87.4% (Sinhala/code-mixed), irrelevant comment filtering F1-scores of 85.2%–88.1%, and aspect-level sentiment classification F1-scores of 87.4%–85.9%, consistently surpassing all baselines.

Technology Category

Application Category

📝 Abstract

Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

Problem

Research questions and friction points this paper is trying to address.

Improves keyword extraction in Sinhala-English code-mixed content

Enhances aspect classification for multilingual banking texts

Filters irrelevant comments in low-resource language datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid NLP method for multilingual content analysis

Fine-tuned XLM-RoBERTa with domain-specific vocabulary

BERT and XLM-RoBERTa outperform GPT-4 in classification

🔎 Similar Papers

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research