🤖 AI Summary
This study addresses the lack of systematic validation regarding the factual fidelity of LLM-driven social agents when generating short-text responses to news. Leveraging the Hatmedia dataset, it presents the first comprehensive evaluation of distributional consistency across five large language models—including Qwen3 and Mistral-7B—in generating Spanish-language news comments, assessed along three dimensions: hate speech prevalence, sentiment polarity, and semantic alignment. The analysis compares off-the-shelf and fine-tuned model strategies, revealing that unmodified models significantly underestimate hate speech and introduce sentiment bias. Among fine-tuned variants, Qwen3 achieves the most balanced performance, while Mistral-7B excels in sentiment and semantic alignment yet overestimates hate speech prevalence. These findings highlight critical limitations and divergences between general-purpose and adapted models in accurately reproducing the distribution of public discourse.
📝 Abstract
LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news.
In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation.
Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.