Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic validation regarding the factual fidelity of LLM-driven social agents when generating short-text responses to news. Leveraging the Hatmedia dataset, it presents the first comprehensive evaluation of distributional consistency across five large language models—including Qwen3 and Mistral-7B—in generating Spanish-language news comments, assessed along three dimensions: hate speech prevalence, sentiment polarity, and semantic alignment. The analysis compares off-the-shelf and fine-tuned model strategies, revealing that unmodified models significantly underestimate hate speech and introduce sentiment bias. Among fine-tuned variants, Qwen3 achieves the most balanced performance, while Mistral-7B excels in sentiment and semantic alignment yet overestimates hate speech prevalence. These findings highlight critical limitations and divergences between general-purpose and adapted models in accurately reproducing the distribution of public discourse.
📝 Abstract
LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.
Problem

Research questions and friction points this paper is trying to address.

LLM-powered social agents
realism evaluation
online news reactions
audience discourse
synthetic text fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-powered social agents
realism evaluation
synthetic audience reactions
hate speech modeling
distributional fidelity
🔎 Similar Papers