A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts

๐Ÿ“… 2025-09-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the absence of large language model (LLM) cross-lingual performance benchmarks for Persian social media sentiment analysis and emotion detection. We conduct the first systematic evaluation of Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o on a balanced Persian dataset, employing unified prompting strategies, standardized inference parameters, and standard metrics (e.g., precision, F1-score) to ensure fair comparison. Our multidimensional evaluation framework jointly assesses accuracy, inference latency, and API cost, while identifying Persian-specific sociolinguistic challenges and recurrent error patterns. Results show all models achieve practical utility: GPT-4o attains the highest accuracy, whereas Gemini 2.0 Flash offers optimal cost-efficiency; emotion classification proves significantly more challenging than sentiment analysis. This work establishes the first empirical, reproducible LLM benchmark for low-resource, non-English NLPโ€”providing both a rigorous evaluation methodology and actionable guidance for model selection in Persian-language applications.

Technology Category

Application Category

๐Ÿ“ Abstract
This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research addresses these gaps through rigorous experimental design using balanced Persian datasets containing 900 texts for sentiment analysis (positive, negative, neutral) and 1,800 texts for emotion detection (anger, fear, happiness, hate, sadness, surprise). The main focus was to allow for a direct and fair comparison among different models, by using consistent prompts, uniform processing parameters, and by analyzing the performance metrics such as precision, recall, F1-scores, along with misclassification patterns. The results show that all models reach an acceptable level of performance, and a statistical comparison of the best three models indicates no significant differences among them. However, GPT-4o demonstrated a marginally higher raw accuracy value for both tasks, while Gemini 2.0 Flash proved to be the most cost-efficient. The findings indicate that the emotion detection task is more challenging for all models compared to the sentiment analysis task, and the misclassification patterns can represent some challenges in Persian language texts. These findings establish performance benchmarks for Persian NLP applications and offer practical guidance for model selection based on accuracy, efficiency, and cost considerations, while revealing cultural and linguistic challenges that require consideration in multilingual AI system deployment.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for Persian sentiment analysis in social media
Assessing emotion detection performance on Persian language texts
Comparing cross-linguistic performance patterns of state-of-the-art LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative evaluation of four LLMs
Using balanced Persian datasets for testing
Analyzing performance metrics and misclassification patterns
๐Ÿ”Ž Similar Papers
No similar papers found.