DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Hate speech detection models often deviate from human-defined criteria, and evaluation relies excessively on single-metric accuracy, lacking rigorous alignment with fine-grained definitions. Method: We propose DefVerify, a three-stage verification framework: (1) formalizing user-provided definitions into logical rules and semantic embeddings; (2) quantifying consistency between model predictions and these definitions; and (3) attributing inconsistencies to either annotation bias or modeling errors. Contribution/Results: DefVerify is the first systematic approach enabling definition-aware, interpretable behavioral verification of hate speech models. Evaluated on six mainstream benchmarks, it reveals substantial “definition–behavior gaps,” identifying annotation inconsistency and erroneous generalization as primary failure modes. The framework shifts hate speech evaluation from black-box accuracy toward definition-consistency validation, supporting transparent diagnosis and workflow-level attribution.

Technology Category

Application Category

📝 Abstract

When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

Hate Speech Detection

Model Consistency

Training Dataset Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

DefVerify

Hate Speech Recognition

Model Verification

🔎 Similar Papers

No similar papers found.