A meta-analysis on the performance of machine-learning based language models for sentiment analysis

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior evaluations of machine learning models for Twitter sentiment analysis suffer from methodological heterogeneity and inconsistent reporting, hindering cross-study comparability and reproducibility. Method: We conducted a systematic meta-analysis of 195 empirical studies following PRISMA guidelines, employing the double-arcsine transformation and a three-level random-effects model to synthesize performance estimates. Contribution/Results: The pooled mean accuracy across models is 0.80 [0.76, 0.84], with significant performance degradation attributable to class imbalance and increasing granularity of sentiment categories. Critically, we identify widespread nonstandardization in current reporting practices. To address this, we propose two methodological improvements: (1) normalization of accuracy metrics to mitigate dataset-induced bias, and (2) mandatory reporting of confusion matrices on independent test sets. Our framework enhances result transparency, enables rigorous cross-study comparison, and establishes a methodological benchmark for evaluation practices in sentiment analysis research.

Technology Category

Application Category

📝 Abstract
This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.
Problem

Research questions and friction points this paper is trying to address.

Evaluating machine learning performance in Twitter sentiment analysis
Assessing study heterogeneity and characteristics influencing model accuracy
Addressing misleading overall accuracy metrics and reporting standardization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-analysis of ML sentiment models
Three-level random effects modeling
Standardized performance reporting guidelines
🔎 Similar Papers
No similar papers found.
E
Elena Rohde
Institute of Sociology, Faculty of Social Sciences University of Duisburg-Essen, Lotharstr. 65, 47057 Duisburg, Germany
J
Jonas Klingwort
Department of Research & Development, Statistics Netherlands (CBS), CBS-weg 11, PO Box 4481, 6401 CZ Heerlen, the Netherlands
Christian Borgs
Christian Borgs
Professor, UC Berkeley
Statistical PhysicsComputer ScienceStatistics