Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the growing practice of deploying large language models (LLMs) as human proxies in computational social science, highlighting that their unconstrained generation diverges significantly from authentic human language and thereby jeopardizes research validity. To tackle this issue, the work introduces the first history-conditioned response prediction benchmark tailored for social media conversations, constructing an evaluation dataset from real-world interactions on the X platform. By employing dual-dimensional metrics—capturing both stylistic and content-related characteristics—the study systematically quantifies the gap between LLM-generated text and genuine human discourse. The findings reveal substantial limitations in current LLMs’ ability to replicate the nuanced linguistic patterns of human communication, offering both a methodological framework and empirical evidence to enhance the realism and validity of synthetic data in social science research.

Technology Category

Application Category

📝 Abstract

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

linguistic discrepancies

synthetic data

computational social science

human communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

history-conditioned reply prediction

linguistic discrepancy

synthetic data evaluation