Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically compares two multimodal social media data collection paradigms—user-donated authentic posts versus experimentally curated, annotated posts—and their impact on sentiment modeling. Using multimodal content analysis, cross-modal consistency evaluation, and standardized model benchmarking, we quantitatively demonstrate that curated posts exhibit longer textual content, weaker visual modality contribution, and more prototypical event representations; critically, they significantly deviate from authentic data in linguistic, visual, and demographic distributions. While such data may improve model generalization under controlled settings, they induce evaluation bias and compromise ecological validity. Only user-donated data enable reliable, real-world assessment of sentiment recognition performance. Our core contribution is the empirical validation of authentic data as indispensable for robust multimodal sentiment computation, establishing a rigorous evidence-based benchmark for future multimodal dataset curation.

Technology Category

Application Category

📝 Abstract
Accurate modeling of subjective phenomena such as emotion expression requires data annotated with authors' intentions. Commonly such data is collected by asking study participants to donate and label genuine content produced in the real world, or create content fitting particular labels during the study. Asking participants to create content is often simpler to implement and presents fewer risks to participant privacy than data donation. However, it is unclear if and how study-created content may differ from genuine content, and how differences may impact models. We collect study-created and genuine multimodal social media posts labeled for emotion and compare them on several dimensions, including model performance. We find that compared to genuine posts, study-created posts are longer, rely more on their text and less on their images for emotion expression, and focus more on emotion-prototypical events. The samples of participants willing to donate versus create posts are demographically different. Study-created data is valuable to train models that generalize well to genuine data, but realistic effectiveness estimates require genuine data.
Problem

Research questions and friction points this paper is trying to address.

Comparing data donation vs creation for emotion-labeled social media posts
Assessing differences in model performance between genuine and study-created content
Evaluating demographic biases in participant samples for each collection method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compare study-created vs genuine social media posts
Analyze multimodal emotion expression differences
Assess model performance on both data types
🔎 Similar Papers
No similar papers found.
C
Christopher Bagdon
Fundamentals of Natural Language Processing, University of Bamberg, Germany
A
Aidan Combs
Department of Sociology, The Ohio State University, USA
C
Carina Silberer
Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Germany
Roman Klinger
Roman Klinger
Professor for Fundamentals of Natural Language Processing, University of Bamberg
natural language processingemotion analysisbioNLPargument miningcomputational psychology