Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the challenge of identifying multi-label harmful content—such as conspiracy theories and incendiary rhetoric—on social media during the 2024 U.S. election. The authors introduce USE24-XD, a novel dataset comprising nearly 100,000 tweets, and present the first large-scale collaborative annotation effort involving six large language models (LLMs) to automatically label five categories of harmful content. By integrating a wisdom-of-crowds strategy with crowdsourced validation, they produce a high-quality multi-label benchmark. Experimental results show that LLMs achieve a recall of 0.90 on the “speculative” category and demonstrate higher internal consistency than human annotators; 60% of posts received at least one label. The work further reveals systematic biases linked to annotators’ political leanings and releases the dataset publicly to support research on electoral information ecosystems.

Technology Category

Application Category

📝 Abstract

The spread of election misinformation and harmful political content conveys misleading narratives and poses a serious threat to democratic integrity. Detecting harmful content at early stages is essential for understanding and potentially mitigating its downstream spread. In this study, we introduce USE24-XD, a large-scale dataset of nearly 100k posts collected from X (formerly Twitter) during the 2024 U.S. presidential election cycle, enriched with spatio-temporal metadata. To substantially reduce the cost of manual annotation while enabling scalable categorization, we employ six large language models (LLMs) to systematically annotate posts across five nuanced categories: Conspiracy, Sensationalism, Hate Speech, Speculation, and Satire. We validate LLM annotations with crowdsourcing (n = 34) and benchmark them against human annotators. Inter-rater reliability analyses show comparable agreement patterns between LLMs and humans, with LLMs exhibiting higher internal consistency and achieving up to 0.90 recall on Speculation. We apply a wisdom-of-the-crowd approach across LLMs to aggregate annotations and curate a robust multi-label dataset. 60% of posts receive at least one label. We further analyze how human annotator demographics, including political ideology and affiliation, shape labeling behavior, highlighting systematic sources of subjectivity in judgments of harmful content. The USE24-XD dataset is publicly released to support future research.

Problem

Research questions and friction points this paper is trying to address.

election misinformation

harmful content

multi-label classification

social media

democratic integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM ensemble

multi-label annotation

harmful content detection