Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses core challenges in Tamil–English code-mixed sentiment analysis—namely, syntactic inconsistency, orthographic variation, and phonetic ambiguity—under low-resource conditions. To this end, we propose a robust classification framework comprising phoneme-level normalization as a preprocessing step, coupled with data augmentation and multi-model ensemble strategies. We conduct the first systematic evaluation of multilingual pretrained language models—including XLM-RoBERTa, mT5, IndicBERT, and RemBERT—on this task, assessing their generalization capabilities in code-mixed, low-resource settings. Experimental results demonstrate significant improvements in classification accuracy over baseline approaches, while also revealing performance ceilings and cross-model comparative insights. Our findings provide a reproducible empirical foundation for high-quality code-mixed corpus construction, annotation guideline development, and cross-lingual sentiment analysis, offering both methodological innovation and practical guidance for under-resourced language processing.

Technology Category

Application Category

📝 Abstract
The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.
Problem

Research questions and friction points this paper is trying to address.

Addressing sentiment analysis challenges in Tamil-English code-mixed texts
Evaluating transformer models for low-resource multilingual sentiment classification
Proposing solutions for data gaps and phonetic ambiguities in code-mixing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer models for Tamil-English code-mixed texts
Addressing grammatical and phonetic inconsistencies
Evaluating XLM-RoBERTa and IndicBERT performance
🔎 Similar Papers
No similar papers found.