🤖 AI Summary
Hate speech dataset construction faces methodological trade-offs, with prevailing practices often compromising reliability for operational convenience.
Method: We conduct a cross-dataset qualitative meta-analysis and methodological critique, systematically identifying twelve recurrent methodological pitfalls. We develop a reproducible evaluation matrix spanning annotation transparency, value positioning, and contextual modeling. Drawing on Max Weber’s “ideal type” theory, we propose a novel three-dimensional framework—value awareness, transparent annotation, and meta-methodological reflection—that integrates classical sociological theory into computational social science data methodology for the first time.
Contribution/Results: This work shifts hate speech research from empirically driven practice toward reflexive scholarship, enhancing dataset rigor, interpretability, and ethical accountability. The framework provides actionable guidance for constructing socially responsible, theoretically grounded, and methodologically transparent hate speech datasets.
📝 Abstract
The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber's notion of ideal types, we argue for a reflexive approach in dataset creation, urging researchers to acknowledge their own value judgments during dataset construction, fostering transparency and methodological rigour.