🤖 AI Summary
This work addresses core challenges in Tamil–English code-mixed sentiment analysis—namely, syntactic inconsistency, orthographic variation, and phonetic ambiguity—under low-resource conditions. To this end, we propose a robust classification framework comprising phoneme-level normalization as a preprocessing step, coupled with data augmentation and multi-model ensemble strategies. We conduct the first systematic evaluation of multilingual pretrained language models—including XLM-RoBERTa, mT5, IndicBERT, and RemBERT—on this task, assessing their generalization capabilities in code-mixed, low-resource settings. Experimental results demonstrate significant improvements in classification accuracy over baseline approaches, while also revealing performance ceilings and cross-model comparative insights. Our findings provide a reproducible empirical foundation for high-quality code-mixed corpus construction, annotation guideline development, and cross-lingual sentiment analysis, offering both methodological innovation and practical guidance for under-resourced language processing.
📝 Abstract
The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.