Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study systematically audits Twitch’s AutoMod system for detecting hate speech—including racism, misogyny, homophobia, and ableism. Using a controlled API-based experimental environment, we injected 107,000 synthetic comments spanning diverse discriminatory contexts. Applying black-box functional auditing and controlled comparative experiments, we identify AutoMod’s core limitation: heavy reliance on lexical matching rather than semantic understanding. Explicit hate speech exhibits a 94% false-negative rate; adding slurs increases detection to 100%, confirming keyword dependency. Conversely, benign usages—such as educational or empowering references to sensitive terms—suffer an 89.5% false-positive rate. Our findings provide the first large-scale empirical evidence of systemic inaccuracy in platform-level automated moderation, directly exposing critical deficiencies in contextual modeling. This work underscores the urgent need to shift from static lexicon-based filtering toward semantics-aware, context-sensitive moderation paradigms.

Technology Category

Application Category

📝 Abstract

To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement($ extit{e.g.}$, users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch's automated moderation tool ($ exttt{AutoMod}$) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch's APIs to send over $107,000$ comments collated from $4$ datasets. We measure $ exttt{AutoMod}$'s accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to $94%$ on some datasets, $ extit{bypass moderation}$. Contextual addition of slurs to these messages results in $100%$ removal, revealing $ exttt{AutoMod}$'s reliance on slurs as a moderation signal. We also find that contrary to Twitch's community guidelines, $ exttt{AutoMod}$ blocks up to $89.5%$ of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in $ exttt{AutoMod}$'s capabilities and underscores the importance for such systems to understand context effectively.

Problem

Research questions and friction points this paper is trying to address.

Auditing Twitch's AutoMod effectiveness in hate speech detection

Measuring AutoMod's accuracy in flagging misogyny, racism, ableism, homophobia

Identifying AutoMod's over-reliance on slurs and context-blind moderation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Auditing Twitch's AutoMod for hate speech

Testing 107,000 comments via Twitch APIs

Revealing AutoMod's reliance on slur detection

🔎 Similar Papers

No similar papers found.