Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing toxic speech datasets largely lack demographic fine-grained annotations—particularly age—hindering intergenerational analysis of linguistic behavior. To address this, we introduce the first large-scale, age-annotated German toxic comment dataset, drawn from Instagram, TikTok, and YouTube. Our methodology innovatively integrates platform-provided age estimates with human–LLM collaborative annotation (3,000+ expert human labels and 30,000+ LLM-assisted labels), covering categories including insults, misinformation, and critiques related to public broadcasting fees. Analyses reveal significant generational patterns: younger users exhibit more emotionally charged language, whereas middle-aged and older users disproportionately propagate misinformation and derogatory content. With a toxicity annotation rate of 16.7%, this dataset establishes a critical benchmark for modeling linguistic variation, developing fairness-aware content moderation systems, and enabling age-sensitive platform governance.

Technology Category

Application Category

📝 Abstract
A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.
Problem

Research questions and friction points this paper is trying to address.

Lack demographic context toxic speech datasets
Need study age-based differences online communication
Develop equitable age-aware content moderation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale German dataset with toxicity and age annotations
Combined human expertise with state-of-the-art language models
Revealed age-based differences in toxic speech patterns
🔎 Similar Papers
No similar papers found.
J
Jan Fillies
Freie Universität Berlin
M
Michael Peter Hoffmann
Freie Universität Berlin
R
Rebecca Reichel
MSB Medical School Berlin
R
Roman Salzwedel
funk - Content-Netzwerk
S
Sven Bodemer
funk - Content-Netzwerk
Adrian Paschke
Adrian Paschke
Professor, Computer Science, Freie Universitaet Berlin
Corporate Semantic WebMachine LearningArtificial IntelligenceData AnalyticsSemantic Technologies