đ¤ AI Summary
This work investigates the capability of large language models (LLMs) to simulate social media user engagement behaviorsâspecifically, predicting usersâ typical actions (e.g., retweeting, quoting, paraphrasing) toward trending posts and generating personalized responsesâwhile evaluating alignment with real human behavior. To this end, we propose an âaction-guided response generationâ framework that decouples action prediction from conditional response generation for the first time. Experiments reveal that LLMs underperform BERT significantly in zero-shot action predictionâhighlighting a fundamental limitation in action reasoningâyet achieve substantially higher semantic similarity in few-shot fine-tuned response generation compared to baselines. Our findings expose a duality in LLMsâ behavioral simulation capacity: strong semantic generation but weak action modeling. This challenges the validity of end-to-end agent paradigms in social behavior modeling and introduces a novel methodological pathway grounded in modular, action-aware response generation.
đ Abstract
Social media enables dynamic user engagement with trending topics, and recent research has explored the potential of large language models (LLMs) for response generation. While some studies investigate LLMs as agents for simulating user behavior on social media, their focus remains on practical viability and scalability rather than a deeper understanding of how well LLM aligns with human behavior. This paper analyzes LLMs' ability to simulate social media engagement through action guided response generation, where a model first predicts a user's most likely engagement action-retweet, quote, or rewrite-towards a trending post before generating a personalized response conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT in action prediction, while few-shot prompting initially degrades the prediction accuracy of LLMs with limited examples. However, in response generation, few-shot LLMs achieve stronger semantic alignment with ground truth posts.