A Decade of News Forum Interactions: Threaded Conversations, Signed Votes, and Topical Tags

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of long-term, large-scale, privacy-compliant online user interaction data for medium-resource languages—specifically German. To this end, we construct an anonymized dataset from a German news forum spanning ten years, comprising 75 million comments and 400 million votes. It is the first publicly released resource to unify decade-long structured conversation trees, explicit upvote/downvote annotations, edit histories, and topic tags. User and comment identifiers are persistently anonymized via salted hashing, and comment embeddings are generated using state-of-the-art transformer-based models to preclude raw text disclosure. Designed to meet core computational social science requirements—including analysis of discussion dynamics, social network evolution, and semantic change—the dataset adheres strictly to GDPR privacy principles. It constitutes the first high-quality, reusable, longitudinal benchmark for studying German-language online public discourse.

Technology Category

Application Category

📝 Abstract
We present a large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper. The dataset spans ten years (2013-2022) and includes over 75 million user comments, more than 400 million votes, and detailed metadata on articles and user interactions. It provides structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy. To ensure this privacy, all persistent identifiers are anonymized using salted hash functions, and the raw comment texts are not publicly shared. Instead, we release pre-computed vector representations derived from a state-of-the-art embedding model. The dataset supports research on discussion dynamics, network structures, and semantic analyses in the mid-resourced language German, offering a reusable resource across computational social science and related fields.
Problem

Research questions and friction points this paper is trying to address.

Analyze online discourse dynamics using threaded conversations and votes
Study network structures and semantic analyses in German language
Provide anonymized dataset for computational social science research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anonymized data using salted hash functions
Pre-computed vector representations from embeddings
Structured threads with votes and topic labels
🔎 Similar Papers
No similar papers found.