Do Large Language Models Possess Sensitive to Sentiment?

📅 2024-09-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study systematically evaluates large language models’ (LLMs) sensitivity to textual sentiment—positive, negative, or neutral—with emphasis on foundational sentiment classification, fine-grained understanding of irony and sarcasm, and associated consistency and biases. Method: We introduce the first cross-model, multi-benchmark, human-integrated evaluation framework for sentiment sensitivity, unifying standard datasets (SST-2, IMDB, TweetEval) with human annotation comparisons, inter-model response consistency quantification, and targeted irony probing experiments. Contribution/Results: Results show that mainstream LLMs exhibit baseline sentiment sensitivity but underperform humans by 12.6% in average classification accuracy; irony/sarcasm detection error rates reach 38.4%, and inter-model performance variance (standard deviation) is 15.2%. We identify model architecture and pretraining data composition as primary determinants of sentiment sensitivity disparities. The framework enables rigorous, reproducible assessment of affective reasoning capabilities across LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.

Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' sentiment detection

Evaluate LLMs' emotional response accuracy

Identify LLMs' sentiment classification errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs' sentiment detection

Compares models with human evaluations

Highlights need for training enhancements

🔎 Similar Papers

No similar papers found.