SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing public COVID-19 datasets suffer from coarse-grained sentiment labels, scarce annotations, and limited multilingual coverage, hindering fine-grained cross-lingual sentiment analysis. Method: We construct the first fine-grained (10-class), multilingual (English, Arabic, Spanish, French, Italian) COVID-19 tweet dataset. Leveraging human annotation augmented by machine translation, large-scale crawling, rigorous cleaning, and temporal alignment, we build a time-stamped corpus comprising 70K labeled tweets and 105M unlabeled tweets. We fine-tune pretrained Transformers for cross-lingual fine-grained classification and validate compatibility with large language models (e.g., ChatGPT). Contribution/Results: Our dataset significantly advances multilingual sentiment analysis performance, achieving state-of-the-art results across multiple languages. All data, code, and models are publicly released to facilitate research on fine-grained, cross-lingual sentiment evolution during major public health emergencies.

Technology Category

Application Category

📝 Abstract

The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repositoryfootnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of fine-grained multilingual sentiment data for COVID-19 tweets

Provides annotated dataset with ten sentiment categories across five languages

Enables analysis of evolving emotional landscapes across countries and topics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned transformer models for sentiment classification

Multi-language dataset with ten sentiment categories

Assessed dataset compatibility with ChatGPT for robustness

🔎 Similar Papers

No similar papers found.