🤖 AI Summary
Understanding what content is valued across different online communities is crucial for social computing tasks such as recommendation, moderation, and ranking. This work proposes VASTU, the first standardized, cross-community benchmark enabling methodologically comparable evaluation, comprising 75,000 comments from 15 Reddit communities along with community endorsement labels and linguistic features. Through systematic evaluation of feature-engineered models, Transformers, and large language models (LLMs)—including both prompting and fine-tuning approaches—under both global and community-specific paradigms, the study finds that community-specific models substantially outperform generic ones. Fine-tuned Transformers achieve the best performance (AUROC = 0.72), with smaller fine-tuned models (0.65) surpassing prompted LLMs (0.60), while reasoning-based models perform worst (0.53), challenging assumptions about the efficacy of LLM reasoning for this task.
📝 Abstract
Detecting what content communities value is a foundational challenge for social computing systems -- from feed curation and content ranking to moderation tools and personalized recommendation systems. Yet existing approaches remain fragmented across methodological paradigms, and it remains unclear which methods best capture community-specific notions of value. We introduce VASTU (Value-Aligned Social Toolkit for Online Content Curation), a benchmark and evaluation framework for systematically comparing approaches to detecting community-valued content. VASTU includes a dataset of 75,000 comments from 15 diverse Reddit communities, annotated with community approval labels and rich linguistic features. Using VASTU, we evaluate feature-based models, transformers, prompted and fine-tuned language models under global versus community-specific training regimes. We find that community-specific models consistently outperform global approaches, with fine-tuned transformers achieving the strongest performance (0.72 AUROC). Notably, fine-tuned SLMs (0.65 AUROC) substantially outperform prompted LLMs (0.60 AUROC) despite being 100 times smaller. Counterintuitively, chain-of-thought prompting provides no benefit, and reasoning models perform the worst (0.53 AUROC), suggesting this task requires learning community norms rather than test-time reasoning. By releasing VASTU, we provide a standardized benchmark to advance research on value-aligned sociotechnical systems.