🤖 AI Summary
This study addresses the lack of effective evaluation frameworks for assessing the semantic understanding and quantitative reasoning capabilities of large language models (LLMs) when processing massive unstructured text, such as social media data. The authors introduce the first fine-grained, question-based benchmark comprising 470 human-crafted questions spanning tasks like sentiment analysis and hate speech detection, and conduct a systematic evaluation of both proprietary and open-source LLMs on diverse Twitter datasets. Their findings reveal a significant performance degradation when input scales exceed 500 posts or when models confront complex quantitative tasks involving comparison, counting, or calculation, thereby exposing critical bottlenecks in LLMs’ ability to perform large-scale textual quantitative analysis and offering clear directions for future model improvement.
📝 Abstract
LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.