A meta-analysis on the performance of machine-learning based language models for sentiment analysis

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Prior evaluations of machine learning models for Twitter sentiment analysis suffer from methodological heterogeneity and inconsistent reporting, hindering cross-study comparability and reproducibility. Method: We conducted a systematic meta-analysis of 195 empirical studies following PRISMA guidelines, employing the double-arcsine transformation and a three-level random-effects model to synthesize performance estimates. Contribution/Results: The pooled mean accuracy across models is 0.80 [0.76, 0.84], with significant performance degradation attributable to class imbalance and increasing granularity of sentiment categories. Critically, we identify widespread nonstandardization in current reporting practices. To address this, we propose two methodological improvements: (1) normalization of accuracy metrics to mitigate dataset-induced bias, and (2) mandatory reporting of confusion matrices on independent test sets. Our framework enhances result transparency, enables rigorous cross-study comparison, and establishes a methodological benchmark for evaluation practices in sentiment analysis research.

Technology Category

Application Category

📝 Abstract

This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating machine learning performance in Twitter sentiment analysis

Assessing study heterogeneity and characteristics influencing model accuracy

Addressing misleading overall accuracy metrics and reporting standardization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-analysis of ML sentiment models

Three-level random effects modeling

Standardized performance reporting guidelines

🔎 Similar Papers

No similar papers found.