Towards High-Fidelity Synthetic Multi-platform Social Media Datasets via Large Language Models

📅 2025-05-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of multi-platform social data, high acquisition costs, and platform-imposed restrictions on real-world data, this paper proposes a topic-driven cross-platform synthetic data generation framework. Methodologically, it designs a multi-platform-adaptive prompt engineering strategy to leverage Llama-3, Claude-3, and GPT-4 for batch generation of platform-specific textual content (e.g., Twitter, Reddit, Instagram). It further introduces the first fidelity evaluation framework tailored for cross-platform social data, integrating lexical and semantic similarity analysis with post-hoc calibration. Experiments demonstrate that the synthetic data closely approximates real data in both lexical distribution and semantic structure (average similarity: 0.82), while revealing significant disparities among LLMs in modeling platform-specific stylistic conventions. The evaluation framework is open-sourced, providing high-quality synthetic data to support downstream tasks such as misinformation detection and influence operation analysis.

Technology Category

Application Category

📝 Abstract
Social media datasets are essential for research on a variety of topics, such as disinformation, influence operations, hate speech detection, or influencer marketing practices. However, access to social media datasets is often constrained due to costs and platform restrictions. Acquiring datasets that span multiple platforms, which is crucial for understanding the digital ecosystem, is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real data. We propose multi-platform topic-based prompting and employ various language models to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings show that using large language models to generate synthetic multi-platform social media data is promising, different language models perform differently in terms of fidelity, and a post-processing approach might be needed for generating high-fidelity synthetic datasets for research. In addition to the empirical evaluation of three state of the art large language models, our contributions include new fidelity metrics specific to multi-platform social media datasets.
Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity synthetic social media data across platforms
Overcoming access constraints to multi-platform social media datasets
Evaluating language models for lexical and semantic data fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large language models for synthetic data generation
Implements multi-platform topic-based prompting
Introduces fidelity metrics for multi-platform datasets