Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Online community content moderation policies—such as those on Reddit subreddits—are often implicit, ambiguous, and poorly documented, hindering cross-community comparability. To address this, we propose an interpretable lexicon-scoring model that automatically infers latent moderation rules from historical moderation logs. Our approach integrates lexical pattern analysis with an inherently interpretable machine learning architecture to map moderation decisions onto a term-weight matrix reflecting statistically significant associations with post removal. While matching black-box neural networks in predictive performance, our model provides transparent, auditable decision rationales. Empirical evaluation across diverse communities reveals systematic differences in linguistic tolerance, topic sensitivity, and fine-grained toxicity classification. Moreover, we identify several undocumented, community-specific moderation behaviors—e.g., differential enforcement of rules across dialects or contexts. These findings offer both a novel analytical tool and empirical grounding for research on platform governance and norm evolution.

Technology Category

Application Category

📝 Abstract

Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.

Problem

Research questions and friction points this paper is trying to address.

Extracting implicit moderation criteria from Reddit communities

Identifying hidden content removal standards using interpretable methods

Comparing enforcement variations of shared norms across communities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretable architecture extracts implicit moderation criteria

Lexical score tables enable cross-community systematic comparison

Transparent pattern extraction replicates neural model performance

🔎 Similar Papers

Perceptions of Moderators as a Large-Scale Measure of Online Community Governance