Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low-resource challenge of Urdu sarcasm detection by proposing a cross-lingual transfer framework for data construction and model evaluation. We translate high-quality English sarcasm corpora into Urdu and perform rigorous human verification, thereby establishing the first publicly available, consistently annotated Urdu sarcasm dataset. Methodologically, we systematically compare traditional word embeddings (GloVe, Word2Vec), multilingual pre-trained language models (mBERT, XLM-RoBERTa), and leading open-source large language models (LLaMA 2/3, Mistral), employing both fine-tuning and embedding-based strategies. Crucially, we integrate Urdu transliteration techniques with modern NLP architectures to bridge a critical gap in Urdu sarcasm recognition research. Experimental results demonstrate that fine-tuned LLaMA-3 (8B) achieves an F1-score of 94.61%, significantly outperforming the gradient-boosting baseline (89.18%), thus validating the efficacy of large language models for low-resource sarcasm detection.

Technology Category

Application Category

📝 Abstract
Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detect irony in Urdu by translating an English Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine learning algorithms using GloVe and Word2Vec embeddings, and compare their performance with classical methods. Additionally, we fine-tune advanced transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B), and Mistral, to assess the effectiveness of large-scale models in irony detection. Among machine learning models, Gradient Boosting achieved the best performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3 (8B) achieved the highest performance with an F1-score of 94.61%. These results demonstrate that combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, a historically low-resource language.
Problem

Research questions and friction points this paper is trying to address.

Detecting irony in Urdu text using comparative machine learning approaches
Evaluating traditional ML models versus large language models for Urdu irony
Addressing irony detection challenges in low-resource Urdu language context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Translating English corpus into Urdu language
Evaluating ten machine learning algorithms with embeddings
Fine-tuning transformer models for irony detection
🔎 Similar Papers
No similar papers found.
F
Fiaz Ahmad
The University of Central Punjab (UCP), Punjab, Pakistan
N
Nisar Hussain
Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
A
Amna Qasim
Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
M
Momina Hafeez
Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
M
Muhammad Usman
Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
Grigori Sidorov
Grigori Sidorov
Professor of Computational Linguistics, Instituto Politécnico Nacional (IPN), Mexico
Computational LinguisticsNatural Language ProcessingArtificial IntelligenceMachine Learning
Alexander Gelbukh
Alexander Gelbukh
Instituto Politécnico Nacional
Computational LinguisticsNatural Language ProcessingSentic ComputingOpinion MiningSentiment